Specify zarr codec when creating a dataset

NeurodataWithoutBorders / lindi

Linked Data Interface (LINDI) - cloud-friendly access to NWB data

BSD 3-Clause "New" or "Revised" License

1 stars 1 forks source link

Specify zarr codec when creating a dataset #35

Closed magland closed 2 months ago

magland commented 3 months ago

The goal here is to allow writing datasets with Zarr-supported codecs that are not available with hdf5 (e.g., zstd and custom codecs). Obviously, the h5py API does not have such a mechanism.

I considered adding a zarr_compressor parameter to create_dataset() but I thought it's best not to make the interface of that function differ between LindiH5pyFile and h5py.File. So instead I made a new function called create_dataset_with_zarr_compressor(..., compressor=codec)

codecov-commenter commented 3 months ago

Codecov Report

Attention: Patch coverage is 47.61905% with 22 lines in your changes are missing coverage. Please review.

Project coverage is 81.53%. Comparing base (bc71a9d) to head (22fc137).

Files	Patch %	Lines
...indi/LindiH5pyFile/writers/LindiH5pyGroupWriter.py	46.66%	16 Missing :warning:
...ndi/conversion/create_zarr_dataset_from_h5_data.py	50.00%	6 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #35 +/- ## ========================================== - Coverage 82.33% 81.53% -0.80% ========================================== Files 25 25 Lines 1715 1755 +40 ========================================== + Hits 1412 1431 +19 - Misses 303 324 +21 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rly commented 3 months ago

This looks reasonable to me, but how this would be used? It seems like this would be used for write outside of PyNWB, which is fine, but I am not totally seeing the use case.

magland commented 3 months ago

This looks reasonable to me, but how this would be used? It seems like this would be used for write outside of PyNWB, which is fine, but I am not totally seeing the use case.

@rly I am thinking specifically about three codecs I would like to use. It's true that this would not be possible via pynwb at this point (it would need to be manual creating of datasets) but maybe at some point in the future this could be somehow supported by pynwb.

The three codecs are:

zstd - compared with gzip, it seems to be faster for encoding/decoding and the CR is higher.
mp4/h.264 video compression - working on this here
Lossy compression of raw ecephys - see this

tag @bendichter

oruebel commented 3 months ago

It's true that this would not be possible via pynwb at this point

With the ZarrIO backend you can specify dataset codecs via ZarrDataIO https://github.com/hdmf-dev/hdmf-zarr/blob/4c9af4a56336da2c421531dfa4487a4b1ff7977a/src/hdmf_zarr/utils.py#L399 If I remember correctly you are here going through HDF5IO so you would need a way to pass the code through via H5DataIO https://github.com/hdmf-dev/hdmf/blob/d85d0cbc36d2e0fdb25e8fbea14d58ba7bf24a40/src/hdmf/backends/hdf5/h5_utils.py#L431 You could make a derived class for H5DataIO for LINDI.

magland commented 2 months ago

note: made a plan with ryan...

magland commented 2 months ago

After chatting with @rly , we decided not to introduce a new special function, but to allow the compression parameter (which is supported by h5py) to be either a string or a numcodecs Codec. In the case of a Codec, this is passed through to zarr as the compressor parameter. In the case of a string, the following logic is used:

        elif isinstance(compression, str):
            if compression == 'gzip':
                if compression_opts is None:
                    level = 4  # default for h5py
                elif isinstance(compression_opts, int):
                    level = compression_opts
                else:
                    raise Exception(f'Unexpected type for compression_opts: {type(compression_opts)}')
                _zarr_compressor = numcodecs.GZip(level=level)
            else:
                raise Exception(f'Compression {compression} is not supported')

Then _zarr_compressor is passed through to zarr.

rly commented 2 months ago

Looks good. We can add support for other h5py compressions, including from hdf5plugin, later as needed.