AllenNeuralDynamics / aind-ophys-utils

Repo for Ophys utils
MIT License
0 stars 0 forks source link

bergamo and mindscope tiffs: Determine file format HDF5/Zarr #19

Open arielleleon opened 12 months ago

arielleleon commented 12 months ago

Tldr; For this issue and from what I have researched, Zarr is HDF5 on steroids. It has an easy-to-use API with an interface that can facilitate cloud reads and writes. It also contains 20 different compression codecs that live in numcodecs whereas HDF5 only contains six built-in compression filters. Zarr (like HDF5) is compatible with tools like Xarray for reading large datasets. With end-users in mind, I would choose to package data in the Zarr format, and I am open to hearing from you all.

Zarr inherits its structure from HDF5 and contains an easy-to-use API which it inherits from NumPy. Both HDF5 and Zarr can chunk and compress. Zarr contains 20 different compression codecs that live in numcodecs with the default compressor being Blosc.

With Zarr, one has the ability to look at the packaging of the data via the command line (via object.info) and in the .zarrarray json on disk. Zarr also packages a .zattr json with all attributes associated with the dataset object which is usually a nifty thing to see.

From what I read; chunking is most optimal over the time dimension. This will need to be tested.

Zarr has an API that facilitates cloud uploads – which I think users will find incredibly useful.

The biggest downfall of HDF5 are the compression filters available. If we decide to use another filter, we will have to ensure that the end user also had it.

Notes from David regarding Zarr:

no matlab support yet (MATLAB implementation of Zarr · Issue #16 · zarr-developers/community (github.com)). I expect this will come in time, so I am not too concerned.

Shouldn't be an issue unless there is algo development in this space