HDF5 reader or Zarr reader?

NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data

BSD 3-Clause "New" or "Revised" License

2 stars 1 forks source link

HDF5 reader or Zarr reader? #17

Closed bendichter closed 5 months ago

bendichter commented 5 months ago

These lines in the readme:

# Try to read using pynwb
# (This part does not work yet)
with pynwb.NWBHDF5IO(file=client, mode="r") as io:
    nwbfile = io.read()
    print(nwbfile)

seem to indicate we will be using h5py to read the file, which confused me. The rest of the readme workflow looks like we are doing something similar to kerchunk, so I would expect to use Zarr here.

magland commented 5 months ago

It does use zarr... but what @rly and I are proposing is that we have a h5py-like client that wraps the zarr nwb. That way we can seamlessly use zarr or hdf5 in the same pynwb -- or with other tools that expect a h5py client.

bendichter commented 5 months ago

Ok so what would the type of the in-memory Dataset be? ZarrDataset, h5py.Dataset, or a new type?

magland commented 5 months ago

Ok so what would the type of the in-memory Dataset be? ZarrDataset, h5py.Dataset, or a new type?

It is a new type, lindi.LindiDataset

But it is duck typed the same as h5py.Dataset... and when it comes to slicing, it just passes that through to Zarr slicing (or h5py slicing if hdf5 is what is being wrapped at that part of the nwb)

We would need to modify pynwb to allow usage of lindi.LindiGroup and lindi.LindiDataset throughout

oruebel commented 5 months ago

Wrapping for h5py is an interesting approach and seems useful, but, I think it may be easier to go through hdmf-zarr first to integrate with PyNWB to use Zarr directly.

oruebel commented 5 months ago

It is a new type, lindi.LindiDataset

Would 'lindi.LindiDataset' inherit from 'h5py.Dataset'?

magland commented 5 months ago

Wrapping for h5py is an interesting approach and seems useful, but, I think it may be easier to go through hdmf-zarr first to integrate with PyNWB to use Zarr directly.

The problem with using the zarr python client here is that it is not flexible enough to link to different types of data as part of the NWB. For example, an existing NWB that has a lot of datasets moderate sized datasets, plus one giant dataset with 1 million chunks (raw ephys). It's not practical to represent that as a Zarr. It needs to be a Zarr with a special annotation on that one dataset with a pointer to the original NWB on DANDI. This is what lindi.LindiClient can handle... with an h5py-compatible api.

Would 'lindi.LindiDataset' inherit from 'h5py.Dataset'?

Possibly... I'm not sure...

bendichter commented 5 months ago

For example, an existing NWB that has a lot of moderate sized datasets, plus one giant dataset with 1 million chunks (raw ephys). It's not practical to represent that as a Zarr.

It looks like the Zarr community is dealing with this by implementing sharding. @magland, are you familiar with this codec specification for sharding in Zarr merged two months ago? Does this resolve the issue with millions of chunks or are there other issues you are hoping to address with LINDI?

magland commented 5 months ago

It looks like the Zarr community is dealing with this by implementing sharding. @magland, are you familiar with this codec specification for sharding in Zarr merged two months ago? Does this resolve the issue with millions of chunks or are there other issues you are hoping to address with LINDI?

I hadn't see that, thanks. I think sharding is the right way to go for newly created NWBs. However I don't think it's practical to re-chunk and re-upload all the existing hdf5 nwb files, and that's where LINDI comes in.

magland commented 5 months ago

@bendichter things are explained better now in the latest README... and the pynwb integration even works!

If it answers your question we can close this issue.