auto-pi-lot / autopilot

Distributed behavioral experiments
https://docs.auto-pi-lot.com
Mozilla Public License 2.0
93 stars 24 forks source link

problems opening hdf5 file #190

Open cxrodgers opened 2 years ago

cxrodgers commented 2 years ago

I am somewhat frequently getting HDF5 errors when starting a session.

                    DEBUG    [registry.get] Found <class 'autopilot.tasks.paft.PAFT'> as subclass of <class 'autopilot.tasks.task.Task'> in module                registry.py:111
                             autopilot.tasks.paft                                                                                                                                
                    DEBUG    [data.subject.Subject.DistractorTaskHigh220726_v8] error opening /home/mouse/autopilot/data/DistractorTaskHigh220726_v8.h5            subject.py:186
Exception in thread Thread-7:
Traceback (most recent call last):
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/threading.py", line 950, in _bootstrap_inner
                    DEBUG    [data.subject.Subject.DistractorTaskHigh220726_v8] error opening /home/mouse/autopilot/data/DistractorTaskHigh220726_v8.h5            subject.py:210
Traceback (most recent call last):
    self.run()
  File "/home/mouse/dev/autopilot/autopilot/gui/widgets/terminal.py", line 442, in toggle_start
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/threading.py", line 888, in run
    self.start_fn(True, self.pilot, current_subject)
  File "/home/mouse/dev/autopilot/autopilot/agents/terminal.py", line 548, in toggle_start
    self._target(*self._args, **self._kwargs)
  File "/home/mouse/dev/autopilot/autopilot/data/subject.py", line 907, in _data_thread
    task = self.subjects[subject].prepare_run()
  File "/home/mouse/dev/autopilot/autopilot/data/subject.py", line 744, in prepare_run
    with self._h5f() as h5f:
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/contextlib.py", line 117, in __enter__
    task_params['step'] = int(self.step)
  File "/home/mouse/dev/autopilot/autopilot/data/subject.py", line 366, in step
    return next(self.gen)
  File "/home/mouse/dev/autopilot/autopilot/data/subject.py", line 183, in _h5f
    return self.protocol.step
  File "/home/mouse/dev/autopilot/autopilot/data/subject.py", line 257, in protocol
    h5f = tables.open_file(str(self.file), mode="r+")
    protocoldict['protocol'] = json.loads(protocol_node.readall())
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/site-packages/tables/nodes/filenode.py", line 246, in readinto
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/site-packages/tables/file.py", line 300, in open_file
    b[:n] = self._node.read(start, stop).tobytes()
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/site-packages/tables/array.py", line 866, in read
    return File(filename, mode, title, root_uep, filters, **kwargs)
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/site-packages/tables/file.py", line 750, in __init__
    arr = self._read(start, stop, step, out)
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/site-packages/tables/array.py", line 823, in _read
    self._g_new(filename, mode, **params)
  File "tables/hdf5extension.pyx", line 486, in tables.hdf5extension.File._g_new
    self._read_array(start, stop, step, arr)
tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5F.c", line 620, in H5Fopen
    unable to open file
  File "H5Iint.c", line 1104, in H5I_inc_ref
    can't locate ID
  File "H5PLpath.c", line 578, in H5PL__path_table_iterate
    can't iterate over plugins in plugin path '(null)'
  File "H5PLpath.c", line 620, in H5PL__path_table_iterate_process_path
    can't open directory: /usr/local/hdf5/lib/plugin
  File "H5Iint.c", line 1104, in H5I_inc_ref
    can't locate ID
  File "H5VLnative_file.c", line 97, in H5VL__native_file_open
    unable to open file
  File "H5Fint.c", line 1862, in H5F_open
    file is already open for read-only

End of HDF5 error back trace

Unable to open/create file '/home/mouse/autopilot/data/DistractorTaskHigh220726_v8.h5'
  File "tables/hdf5extension.pyx", line 1537, in tables.hdf5extension.Array._read_array
tables.exceptions.HDF5ExtError: Problems reading the array data.

I feel like the problem might be here: https://github.com/auto-pi-lot/autopilot/blob/04b5968ba02c8a1413a27eb6a138b6a186b130f1/autopilot/data/subject.py#L189 If one thread has the file open for read/write, and another one tries to open for read, then the try/except here (lines 188-194) will have the second thread open it for read/write as well. But that's not going to work, right, because only one thread can have write access at a time?

sneakers-the-rat commented 2 years ago

hmm, there are a few things that seem to be going on here. the line numbers are different, which commit/branch is this from?

it should be possible to open up multiple read-only handles, as looks like should be the case when accessing the current step. looks like this happens during the prepare_run call, but also looks like multiple threads exceptions are overlapping too.

I can try to reproduce, just lmk where the task plugin is/which version of the code this is, otherwise ya puzzling calls would need to trace more explicitly with debug messages.

sneakers-the-rat commented 2 years ago

PS thanks for all these, sorry for the incomplete testing that's making them necessary, just time and effort crunch.

cxrodgers commented 2 years ago

You’re right, it’s in a branch where i’ve changed other things, next week I’ll try to reproduce in the dev branch so I know whether it’s my own changes that have caused this..

On Wed, Aug 10, 2022 at 23:33 Jonny Saunders @.***> wrote:

PS thanks for all these, sorry for the incomplete testing that's making them necessary, just time and effort crunch.

— Reply to this email directly, view it on GitHub https://github.com/auto-pi-lot/autopilot/issues/190#issuecomment-1211518314, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEWFUIINZ7YAMWKYQBUY4TVYRX75ANCNFSM56CFDH7A . You are receiving this because you authored the thread.Message ID: @.***>

cxrodgers commented 2 years ago

Ok I think I can more or less replicate this bug on the git@github.com:auto-pi-lot/dev branch. To do this, I start the Terminal GUI, create a new Subject on a protocol for the Nafc task, and then click the green Start and red Stop buttons a few times (waiting a few seconds in between, but not triggering any pokes). So I'm just running empty sessions. Eventually I start seeing HDF5 errors go by in the terminal. More rarely, but not that rarely, I get a SegmentationFault.

This is a fiddly bug and hard to reproduce, it doesn't happen every single time, but it happens often enough to be problematic. My intuition is that it has something to do with the fact that the HDF5 file is opened so many times (possibly every time the current protocol step is queried?), and that the _h5f function will either open in read-only or read-write depending on what else has an open handle at that time. It's exacerbated when I add debugging statements here or there (for example, in here: https://github.com/auto-pi-lot/autopilot/blob/04b5968ba02c8a1413a27eb6a138b6a186b130f1/autopilot/data/subject.py#L223). Possibly there's some race condition where multiple threads are trying to access the HDF5 file at different times.

I am wondering if it might be easier to keep the subject-specific parameters (current protocol step) in a different file from the session data. Then there's no competition between _data_thread writing session data, and everywhere else loading subject-specific parameters.

Here's an example of the HDF5 error:

                    ERROR    [data.subject.Subject.NAFC_v15] Exception in graduation parameter specification, graduation is disabled.                              subject.py:832
                             got error: 'type'                                                                                                                                   
                             ╭──────────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────────╮               
                             │                                                                                                                                   │               
                             │ /home/mouse/dev/autopilot/autopilot/data/subject.py:804 in _prepare_graduation                                                    │               
                             │                                                                                                                                   │               
                             │    801 │                                                                                                                          │               
                             │    802 │   def _prepare_graduation(self, task_params:dict, trial_tab:tables.table.Table) ->                                       │               
                             │        'Graduation':                                                                                                              │               
                             │    803 │   │   try:                                                                                                               │               
                             │ ❱  804 │   │   │   grad_type = task_params['graduation']['type']                                                                  │               
                             │    805 │   │   │   grad_params = task_params['graduation']['value'].copy()                                                        │               
                             │    806 │   │   │                                                                                                                  │               
                             │    807 │   │   │   # add other params asked for by the task class                                                                 │               
                             ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯               
                             KeyError: 'type'                                                                                                                                    
Exception in thread Thread-40:
Traceback (most recent call last):
  File "/home/mouse/dev/autopilot/autopilot/data/subject.py", line 180, in _h5f
    h5f = tables.open_file(str(self.file), mode="r+")
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/site-packages/tables/file.py", line 300, in open_file
[22-08-17T16:43:39] DEBUG    [networking.node.Net_Node._T] MESSAGE SENT - ID: _T_23; TO: rpi01; SENDER: _T; KEY: START; FLAGS: {'NOREPEAT': True}; VALUE:             node.py:333
                             {'bias_mode': False, 'correction': False, 'graduation': {}, 'punish_stim': False, 'req_reward': False, 'reward': 50, 'step_name':                   
                             'Nafc', 'stim': {'sounds': {'L': [{'amplitude': '.1', 'channel': '0', 'duration': '100', 'type': 'Noise'}], 'R': [{'amplitude': '.1',               
                             'channel': '1', 'duration': '100', 'type': 'Noise'}]}, 'tag': 'Sounds', 'type': 'sounds'}, 'task_type': 'Nafc', 'subject': 'NAFC_v15',              
                             'step': 0, 'current_trial': 1, 'session': 4, 'pilot': 'rpi01'}                                                                                      
    return File(filename, mode, title, root_uep, filters, **kwargs)
                    DEBUG    [networking.node.Net_Node._T] MESSAGE SENT - ID: _T_24; TO: P_rpi01; SENDER: _T; KEY: START; FLAGS: {'NOREPEAT': True}; VALUE:           node.py:333
                             {'bias_mode': False, 'correction': False, 'graduation': {}, 'punish_stim': False, 'req_reward': False, 'reward': 50, 'step_name':                   
                             'Nafc', 'stim': {'sounds': {'L': [{'amplitude': '.1', 'channel': '0', 'duration': '100', 'type': 'Noise'}], 'R': [{'amplitude': '.1',               
                             'channel': '1', 'duration': '100', 'type': 'Noise'}]}, 'tag': 'Sounds', 'type': 'sounds'}, 'task_type': 'Nafc', 'subject': 'NAFC_v15',              
                             'step': 0, 'current_trial': 1, 'session': 4, 'pilot': 'rpi01'}                                                                                      
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/site-packages/tables/file.py", line 750, in __init__
    self._g_new(filename, mode, **params)
  File "tables/hdf5extension.pyx", line 486, in tables.hdf5extension.File._g_new
tables.exceptions.HDF5ExtError: HDF5 error back trace

  File "H5F.c", line 620, in H5Fopen
    unable to open file
  File "H5VLcallback.c", line 3501, in H5VL_file_open
    failed to iterate over available VOL connector plugins
  File "H5PLpath.c", line 578, in H5PL__path_table_iterate
    can't iterate over plugins in plugin path '(null)'
  File "H5PLpath.c", line 620, in H5PL__path_table_iterate_process_path
    can't open directory: /usr/local/hdf5/lib/plugin
  File "H5VLcallback.c", line 3351, in H5VL__file_open
    open failed
  File "H5VLnative_file.c", line 97, in H5VL__native_file_open
    unable to open file
  File "H5Fint.c", line 1862, in H5F_open
    file is already open for read-only

End of HDF5 error back trace

Unable to open/create file '/home/mouse/autopilot/data/NAFC_v15.h5'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/threading.py", line 950, in _bootstrap_inner
    self.run()
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/threading.py", line 888, in run
    self._target(*self._args, **self._kwargs)
  File "/home/mouse/dev/autopilot/autopilot/data/subject.py", line 856, in _data_thread
    with self._h5f() as h5f:
  File "/home/mouse/miniconda3/envs/autopilot/lib/python3.9/contextlib.py", line 117, in __enter__
    return next(self.gen)
  File "/home/mouse/dev/autopilot/autopilot/data/subject.py", line 183, in _h5f
    h5f.flush()
UnboundLocalError: local variable 'h5f' referenced before assignment

Here's an example of the SegmentationFault:

                    ERROR    [data.subject.Subject.NAFC_v15] Exception in graduation parameter specification, graduation is disabled.                              subject.py:832
                             got error: 'type'                                                                                                                                   
                             ╭──────────────────────────────────────────────── Traceback (most recent call last) ────────────────────────────────────────────────╮               
                             │                                                                                                                                   │               
                             │ /home/mouse/dev/autopilot/autopilot/data/subject.py:804 in _prepare_graduation                                                    │               
                             │                                                                                                                                   │               
                             │    801 │                                                                                                                          │               
                             │    802 │   def _prepare_graduation(self, task_params:dict, trial_tab:tables.table.Table) ->                                       │               
                             │        'Graduation':                                                                                                              │               
                             │    803 │   │   try:                                                                                                               │               
                             │ ❱  804 │   │   │   grad_type = task_params['graduation']['type']                                                                  │               
                             │    805 │   │   │   grad_params = task_params['graduation']['value'].copy()                                                        │               
                             │    806 │   │   │                                                                                                                  │               
                             │    807 │   │   │   # add other params asked for by the task class                                                                 │               
                             ╰───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯               
                             KeyError: 'type'                                                                                                                                    
Segmentation fault (core dumped)
cxrodgers commented 2 years ago

Still working on it, but I think I have a workaround. It's a substantial and breaking change to the data format, but it's something I've been meaning to implement in my own fork for a while now anyway. Basically, the trial data for each session is now stored in its own hdf5 file. So there is the subject hdf5 file which contains parameters, and a separate data hdf5 file for each session. This seems to work better for me so far ... I would speculate because _data_thread doesn't need to write to the same file that the rest of autopilot is reading/writing.

I've been meaning to make this change for a while now anyway. It makes more sense for my workflow because each session can have its own columns for trial data, so it works well when the task changes over time (e.g., multiple very distinct stages of training with totally different parameters). It also makes it easier to deal with the inevitable session that needs to be deleted or renamed because something broke or the wrong mouse was clicked -- just delete the file or rename it on disk.

sneakers-the-rat commented 2 years ago

Sorry I have been AWOL on this and that it's been a headache. I am trying to take some time off, and part of that is i wanted to catch up with the issues you've raised here, but i keep getting pulled into other things. I think it would be nice to be able to do bidirectional multi-file routines: eg. be able to 'split up' a single subject file into multiple files and then 'zip' them back up into a single one.

Wonder how you have been doing the training - the currently subject model will make separate folders for data from multiple tasks within a single protocol, and then if the task changes the data should be stashed and new tables created. There isn't a neat and tidy way of then browsing and restoring that old data which has been a longtime todo item, but the idea is that it would preserve full provenance: rather than deleting data, being able to indicate when it was junk or not.

The multi-file thing is interesting to me because 'zipping' multiple datasets is the direction I eventually wanted to head in anyway, be able to pull in eg. output from open ephys at the end of a recording, and the problems with writing to a single hdf5 file are, as you note, a pain in the ass. So maybe opening these as 'temporary' or 'raw' data and then having an 'export' method to combine them all? sorry for being redundant running to do something else rn.

cxrodgers commented 2 years ago

No worries at all! I do hope you take some time off, and I hope you don't spend that timeoff working on my github issues. :) I mostly posted this because I wanted to let you know I have a workaround now so you don't have to sweat this issue.

I see what you mean, the current model should handle multiple tasks. However, because everything is always changing, I usually just change my current task in place, which unfortunately defeats that mechanism. In the long term that shouldn't happen but you know how it goes.

In general I like the idea of "many medium size files" that can be zipped together as need be. The one monolithic file approach can be hard to maintain, especially as things change. The zillions of tiny files approach can be hard to organize.

I also like to keep a copy of all of the source files (or at least the task file and its parameters) at the time each session was run, alongside the data. Saving the files has saved me a couple times in the past when I couldn't remember when I changed something critical. I see you have the commit hash stored with the session, which is a good idea, but I think I'm not always so good about committing changes before running the task.

jmdelahanty commented 2 years ago

No idea if appropriate or would be helpful at all, but maybe zarrs could be a way to do this kind of thing? It seems like they allow for parallel read/write a lot better out of the box than what H5 was designed to do but I don't actually know much about it for this kind of use case. Downside with zarrs is that it's a nested structure of many small files so they can be hard to move around if you need to. Perhaps zarrs could be an intermediate store that are stitched together later on?