Closed bendichter closed 5 months ago
It does use zarr... but what @rly and I are proposing is that we have a h5py-like client that wraps the zarr nwb. That way we can seamlessly use zarr or hdf5 in the same pynwb -- or with other tools that expect a h5py client.
Ok so what would the type of the in-memory Dataset be? ZarrDataset, h5py.Dataset, or a new type?
Ok so what would the type of the in-memory Dataset be? ZarrDataset, h5py.Dataset, or a new type?
It is a new type, lindi.LindiDataset
But it is duck typed the same as h5py.Dataset... and when it comes to slicing, it just passes that through to Zarr slicing (or h5py slicing if hdf5 is what is being wrapped at that part of the nwb)
We would need to modify pynwb to allow usage of lindi.LindiGroup and lindi.LindiDataset throughout
Wrapping for h5py is an interesting approach and seems useful, but, I think it may be easier to go through hdmf-zarr first to integrate with PyNWB to use Zarr directly.
It is a new type, lindi.LindiDataset
Would 'lindi.LindiDataset' inherit from 'h5py.Dataset'?
Wrapping for h5py is an interesting approach and seems useful, but, I think it may be easier to go through hdmf-zarr first to integrate with PyNWB to use Zarr directly.
The problem with using the zarr python client here is that it is not flexible enough to link to different types of data as part of the NWB. For example, an existing NWB that has a lot of datasets moderate sized datasets, plus one giant dataset with 1 million chunks (raw ephys). It's not practical to represent that as a Zarr. It needs to be a Zarr with a special annotation on that one dataset with a pointer to the original NWB on DANDI. This is what lindi.LindiClient can handle... with an h5py-compatible api.
Would 'lindi.LindiDataset' inherit from 'h5py.Dataset'?
Possibly... I'm not sure...
For example, an existing NWB that has a lot of moderate sized datasets, plus one giant dataset with 1 million chunks (raw ephys). It's not practical to represent that as a Zarr.
It looks like the Zarr community is dealing with this by implementing sharding. @magland, are you familiar with this codec specification for sharding in Zarr merged two months ago? Does this resolve the issue with millions of chunks or are there other issues you are hoping to address with LINDI?
It looks like the Zarr community is dealing with this by implementing sharding. @magland, are you familiar with this codec specification for sharding in Zarr merged two months ago? Does this resolve the issue with millions of chunks or are there other issues you are hoping to address with LINDI?
I hadn't see that, thanks. I think sharding is the right way to go for newly created NWBs. However I don't think it's practical to re-chunk and re-upload all the existing hdf5 nwb files, and that's where LINDI comes in.
@bendichter things are explained better now in the latest README... and the pynwb integration even works!
If it answers your question we can close this issue.
These lines in the readme:
seem to indicate we will be using h5py to read the file, which confused me. The rest of the readme workflow looks like we are doing something similar to kerchunk, so I would expect to use Zarr here.