flatironinstitute / dendro-old

Analyze neuroscience data in the cloud
https://flatironinstitute.github.io/dendro-docs/
Apache License 2.0
19 stars 2 forks source link

OSError: Can't synchronously read data (inflate() failed) #57

Closed luiztauffer closed 12 months ago

luiztauffer commented 1 year ago

Job id: 7ac261d9

error track:

Starting kilosort2_5 processor
Opening remote input file
:::::::::::::::::::: PROCESSOR ELAPSED TIME: 0.948 s
Creating input recording
:::::::::::::::::::: PROCESSOR ELAPSED TIME: 1.720 s
Creating binary recording

write_binary_recording:  10%|#         | 95/917 [59:55<8:38:30, 37.85s/it]
Traceback (most recent call last):
  File "/app/main.py", line 187, in <module>
    app.run()
  File "/src/dendro/python/dendro/sdk/App.py", line 77, in run
    return self._run_job(job_id=JOB_ID, job_private_key=JOB_PRIVATE_KEY)
  File "/src/dendro/python/dendro/sdk/App.py", line 187, in _run_job
    processor_class.run(context)
  File "/app/main.py", line 85, in run
    recording_binary = make_int16_recording(recording, dirname='/tmp/int16_recording')
  File "/app/make_int16_recording.py", line 28, in make_int16_recording
    si.BinaryRecordingExtractor.write_recording(
  File "/home/miniconda3/lib/python3.9/site-packages/spikeinterface/core/binaryrecordingextractor.py", line 147, in write_recording
    write_binary_recording(recording, file_paths=file_paths, dtype=dtype, **job_kwargs)
  File "/home/miniconda3/lib/python3.9/site-packages/spikeinterface/core/core_tools.py", line 314, in write_binary_recording
    executor.run()
  File "/home/miniconda3/lib/python3.9/site-packages/spikeinterface/core/job_tools.py", line 376, in run
    res = self.func(segment_index, frame_start, frame_stop, worker_ctx)
  File "/home/miniconda3/lib/python3.9/site-packages/spikeinterface/core/core_tools.py", line 233, in _write_binary_chunk
    traces = recording.get_traces(
  File "/home/miniconda3/lib/python3.9/site-packages/spikeinterface/core/baserecording.py", line 278, in get_traces
    traces = rs.get_traces(start_frame=start_frame, end_frame=end_frame, channel_indices=channel_indices)
  File "/app/NwbRecording.py", line 80, in get_traces
    return self._electrical_series_data[start_frame:end_frame, channel_indices]
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/home/miniconda3/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 758, in __getitem__
    return self._fast_reader.read(args)
  File "h5py/_selector.pyx", line 376, in h5py._selector.Reader.read
OSError: Can't synchronously read data (inflate() failed)

@magland

magland commented 1 year ago

I've seen this before. It's a tough one because it seems to be internal to h5py. May be an issue with remfile somehow.

luiztauffer commented 1 year ago

@magland can we optionally use fsspec (remfile being the default) in order to test that?

magland commented 1 year ago

@magland can we optionally use fsspec (remfile being the default) in order to test that?

That's a good idea. Would be nice to isolate a reproducible example of error outside of dendro job.

magland commented 1 year ago

I will add option to use fsspec as an alternative to remfile (as discussed in our meeting).

magland commented 1 year ago

@luiztauffer when you have a chance, could you share with me a link to the project where this happened? Hopefully this is reproducible and we can track down the problem.

EDIT: nvm, I found it.

magland commented 1 year ago

@luiztauffer I realized something important in this example. The elapsed time was 3602.975 seconds, which almost certainly means that the issue was caused by an expired download url for an embargoed dandiset. I assume (hope) that this was running using a ks2.5 app that was built before I made modifications to enable auto-renewing of the download url.

On a related topic for this dataset... the download was taking a very long time because the chunking of the elec series is very inefficient. The data need to be re-uploaded with the latest chunking settings in neuroconv.

luiztauffer commented 1 year ago

I assume (hope) that this was running using a ks2.5 app that was built before I made modifications to enable auto-renewing of the download url.

Ok let's hope so! I'll try that again later with the latest version of the App

On a related topic for this dataset... the download was taking a very long time because the chunking of the elec series is very inefficient. The data need to be re-uploaded with the latest chunking settings in neuroconv.

Yes, we should re-upload this with the improved chunking format, but in general we shouldn't always count on the chunking of all files to be made efficiently. Can we have an eager_loading option for these apps as well? Would it make sense to have this as a feature of InputFile?

magland commented 1 year ago

Ok let's hope so! I'll try that again later with the latest version of the App

I wouldn't try it with this example. It took 1 hr to download only 10% of the data. I think we need to assume reasonable chunking.

Yes, we should re-upload this with the improved chunking format, but in general we shouldn't always count on the chunking of all files to be made efficiently. Can we have an eager_loading option for these apps as well? Would it make sense to have this as a feature of InputFile?

What do you mean eager_loading? Do you mean downloading the entire .nwb file up-front? We can do that, but do you think it should be a parameter of the processor?

luiztauffer commented 1 year ago

yes, downloading the whole file instead of streaming, maybe eager is not the best word there. Yes, a parameter of the processors, but the download code could be a feature of InputFile, if that makes sense?

magland commented 1 year ago

Adding this note here. Another reason we don't want eager (pre-download) option to be the default is that we want to be able to efficiently process a time segment of the dataset.

magland commented 1 year ago

@luiztauffer I tried to run this again with the pre-download option, and here's what happened

Download took 2.7 hours (160 GB)

Then the next step was creating a int16 binary recording file by extracting the electrical series from the .nwb file. This only got to 67% complete when the total of 5 hrs timeout expired for the job. The reason it was so slow, even when the file was on the disk, is because the chunking issue also applies to reading locally.

We can discuss this more, but I think the bottom line is, it's very important to ensure that the electrical series have sensible chunking.

luiztauffer commented 12 months ago

problem seems to have been solved with proper chunking of data at conversion step!