Use ROS3 HDF5 driver or fsspec with local sparse cache for more efficient access.

yarikoptic commented 2 years ago

Relates to #264 as possibly avoidable via complete avoidance of fetching an .nwb file in full twice. Also might be of interest in the scope of the https://github.com/OpenSourceBrain/DANDIArchiveShowcase @anhknguyen96 is working on.

https://pynwb.readthedocs.io/en/stable/tutorials/advanced_io/streaming.html gives an example of how to use ros3 HDF5 driver to access remote file on S3 bucket (e.g. dandiarchive) without downloading it in full.

Another approach is HDF5 agnostic, using some fsspec but it would require pynwb to be able to open from an existing file handle which I am not sure if possible -- filed https://github.com/NeurodataWithoutBorders/pynwb/issues/1525 . (well -- alternative is a fuse file system like the one provided by https://github.com/datalad/datalad-fuse/ for that file -- but might be too ad-hoc/heavy although quite possible via FUSE'ing an entire bucket whenever request comes in, and using local cache with some garbage-collection routines to prune it down once in a while).

anhknguyen96 commented 2 years ago

Hi, I think this could be a nice idea, although there are 2 concerns I have:

Read timeout (connection error) issue: as elaborated in this discussion , nwbinspector failed to fetch files via ROS3, possibly due to a connection issue. However, in this instance, I was trying to assess an entire dandiset (multiple files), so that might be a different picture.
From pynwb's streaming tutorial

The values of datasets are accessed lazily, just like when reading an NWB file stored locally. So, slicing into a dataset will require additional time to download the sliced data (and only the sliced data) to memory

If I understand this correctly, local files and streamed files are read in the same fashion, but the latter are only accessed in slices. If that's the case, in terms of usability, there might be added waiting time in between visualization of the slices, or even a possibility of an connection error? More critically, will all the information be presented on NWBE (acquisition/sweep series, as displayed in the below example screenshot), as assuming they wouldn't exist locally if not accessed? Screenshot from 2022-08-05 11-08-14

yarikoptic commented 2 years ago

FWIW, the streaming tutorial is being updated in https://github.com/NeurodataWithoutBorders/pynwb/pull/1526 to reflect possible setup with fsspec and sparse caching. I think such setup as demonstrated around

https://github.com/NeurodataWithoutBorders/pynwb/pull/1526/files#diff-7ba679ce5178386c42b194737582577de3ed40ddd4478e85592d304efbae262bR83

would be ideal for nwb-explorer, while caching locally (and expiring eventually) accessed parts of the files, thus leading to fast performance for frequently accessed files, without needing to download them in full (unless they are accessed in full or just smaller than a default fsspec block size)

If I understand this correctly, local files and streamed files are read in the same fashion, but the latter are only accessed in slices.

Since slice can be a semantically meaningful in neuroscience term, let's call them "blocks". But overall -- correct

If that's the case, in terms of usability, there might be added waiting time in between visualization of the slices, or even a possibility of an connection error?

correct! But there is possibility for a significant (x10, x100, ... ?) speed up in initial waiting time while avoiding lengthy or prohibitive in size initial download .

After all it could just be an option as well -- either to download in full or provide cached or ROS3 access to nwb.

More critically, will all the information be presented on NWBE (acquisition/sweep series, as displayed in the below example screenshot), as assuming they wouldn't exist locally if not accessed?

best to ask @bendichter but I guess the would come as requested.

bendichter commented 2 years ago

pynwb reads the entire structure of the HDF5 file and all attributes on the io.read() call. The only data that would be downloaded as requested would be the values of the datasets.

filippomc commented 2 years ago

NWB Explorer is not really tested with all the streaming cases in mind yet, but it follows the pynwb ideas of separate listing and data fetching. So we need to instruct NWB explorer use the proper api (https://github.com/MetaCell/nwb-explorer/blob/ca06d267d36eb179c34433b64483d3c80ee32d29/nwb_explorer/nwb_model_interpreter/nwb_reader.py#L113) and then launch io.read on a remote resource -- note that now external urls are downloaded locally by default https://github.com/MetaCell/nwb-explorer/blob/ca06d267d36eb179c34433b64483d3c80ee32d29/nwb_explorer/nwb_data_manager.py#L45

MetaCell / nwb-explorer

Use ROS3 HDF5 driver or fsspec with local sparse cache for more efficient access. #307