Open yarikoptic opened 2 years ago
Hi, I think this could be a nice idea, although there are 2 concerns I have:
The values of datasets are accessed lazily, just like when reading an NWB file stored locally. So, slicing into a dataset will require additional time to download the sliced data (and only the sliced data) to memory
If I understand this correctly, local files and streamed files are read in the same fashion, but the latter are only accessed in slices. If that's the case, in terms of usability, there might be added waiting time in between visualization of the slices, or even a possibility of an connection error? More critically, will all the information be presented on NWBE (acquisition/sweep series, as displayed in the below example screenshot), as assuming they wouldn't exist locally if not accessed?
FWIW, the streaming tutorial is being updated in https://github.com/NeurodataWithoutBorders/pynwb/pull/1526 to reflect possible setup with fsspec and sparse caching. I think such setup as demonstrated around
would be ideal for nwb-explorer, while caching locally (and expiring eventually) accessed parts of the files, thus leading to fast performance for frequently accessed files, without needing to download them in full (unless they are accessed in full or just smaller than a default fsspec block size)
If I understand this correctly, local files and streamed files are read in the same fashion, but the latter are only accessed in slices.
Since slice
can be a semantically meaningful in neuroscience term, let's call them "blocks". But overall -- correct
If that's the case, in terms of usability, there might be added waiting time in between visualization of the slices, or even a possibility of an connection error?
correct! But there is possibility for a significant (x10, x100, ... ?) speed up in initial waiting time while avoiding lengthy or prohibitive in size initial download .
After all it could just be an option as well -- either to download in full or provide cached or ROS3 access to nwb.
More critically, will all the information be presented on NWBE (acquisition/sweep series, as displayed in the below example screenshot), as assuming they wouldn't exist locally if not accessed?
best to ask @bendichter but I guess the would come as requested.
pynwb reads the entire structure of the HDF5 file and all attributes on the io.read()
call. The only data that would be downloaded as requested would be the values of the datasets.
NWB Explorer is not really tested with all the streaming cases in mind yet, but it follows the pynwb ideas of separate listing and data fetching. So we need to instruct NWB explorer use the proper api (https://github.com/MetaCell/nwb-explorer/blob/ca06d267d36eb179c34433b64483d3c80ee32d29/nwb_explorer/nwb_model_interpreter/nwb_reader.py#L113) and then launch io.read
on a remote resource -- note that now external urls are downloaded locally by default https://github.com/MetaCell/nwb-explorer/blob/ca06d267d36eb179c34433b64483d3c80ee32d29/nwb_explorer/nwb_data_manager.py#L45
Relates to #264 as possibly avoidable via complete avoidance of fetching an .nwb file in full twice. Also might be of interest in the scope of the https://github.com/OpenSourceBrain/DANDIArchiveShowcase @anhknguyen96 is working on.
https://pynwb.readthedocs.io/en/stable/tutorials/advanced_io/streaming.html gives an example of how to use
ros3
HDF5 driver to access remote file on S3 bucket (e.g. dandiarchive) without downloading it in full.Another approach is HDF5 agnostic, using some fsspec but it would require pynwb to be able to open from an existing file handle which I am not sure if possible -- filed https://github.com/NeurodataWithoutBorders/pynwb/issues/1525 . (well -- alternative is a fuse file system like the one provided by https://github.com/datalad/datalad-fuse/ for that file -- but might be too ad-hoc/heavy although quite possible via FUSE'ing an entire bucket whenever request comes in, and using local cache with some garbage-collection routines to prune it down once in a while).