Evaluate Zarr as a possible alternate array data / storage backend

oruebel commented 6 years ago

Possible Future Enhancement

This ticket is mainly as a note for a possible future enhancement, i.e., something to possibly look at in the future. I just came across the Zarr library http://zarr.readthedocs.io/en/latest/index.html The library implements an interface that looks similar to h5py but is designed more generally for chunked, compressed, N-dimensional arrays with the ability to serialize data to files on disk, inside a Zip file, on S3. As such the it might be a possible candidate for implementing alternate storage backends in the future. It looks like Zarr is still in the early stages of development.

Problem/Use Case

This ticket will be mainly relevant to future discussion of possible other storage backends in addition to HDF5 in the future.

jakirkham commented 5 years ago

Thanks for raising this issue. It would be great to discuss with you about your experiences trying Zarr. Opened issue ( https://github.com/zarr-developers/zarr/issues/333 ) to discuss any feedback and/or suggestions you may have.

oruebel commented 5 years ago

@jakirkham I tested to drop in zarr as backend a while ago. Getting the base primitives to work (ie., Groups, Datasets, Attributes) is not too complicated because you can pretty much just copy the HDF5IO backend in FORM and replace h5py. The part that gets more tricky is how to deal with links and datasets of object- and region-references. As far as I can tell, those concepts are not supported by Zarr. That is basically the part I stopped at, because I didn't have the time to really work on implementing those features in a cross-platform way.

jakirkham commented 5 years ago

Could you please explain a bit more about why those features are needed for NWB?

oruebel commented 5 years ago

NWB:N stores data from complex neurophysiology experiments. As such, there are large collections of different kinds of data and metadata that are related to each other. Links to other objects are critical to allow users to identify ,e.g., related metadata and avoid duplicate storage of the same data. Datasets of object references can serve a similar purpose but allow us to reference multiple objects, e.g,. which TimeSeries belong to a sweep or epoch. Region references then allow use to transparently reference subsets of datasets and is critical to support annotation of data (e.g., epochs), referencing of of subsets of metadata (e.g., select electrodes), and allows us to support ragged arrays , e.g., to create variable-length data vectors in tables. Ultimately, linking data allows us to make relationships between data explicit, avoid data duplication, and model complex links between data that pure hierarchical structure can't express. For details on where links, object, and region references are being used in NWB:N please the format specification https://nwb-schema.readthedocs.io/en/latest/format.html

jakirkham commented 5 years ago

Thanks for the details. Have raised issue ( https://github.com/zarr-developers/zarr/issues/389 ). To discuss how best to implement this in Zarr. Feel free to chime in over there.

oruebel commented 5 years ago

@jakirkham thanks for creating the zarr developer issue. If this can be done, then I think this would be a great opportunity to then also do a comparison of the different Zarr backends and HDF5 for different NWB:N use cases.

oruebel commented 5 years ago

See also #629

chrisroat commented 4 years ago

I think this is a very interesting discussion. While I've become a user of many of the tools people on this thread are developing, I'm not down in the trenches, so please forgive if my comments here are naive.

It has been pointed out that for full zarr<->NWB integration, zarr would need links and references. If I take a step back, this feels like it's putting extra pressure and complexity on the raw data format. Is the use of zarr attributes, which seems to be what is happening in https://github.com/hdmf-dev/hdmf/pull/98, going to be sufficient? It seems like HDMF is a nice middle layer that coordinates things like this, though I could imagine something general like some of the proposals in https://github.com/zarr-developers/zarr-specs/issues/49

jakirkham commented 4 years ago

If working with Zarr attributes would be sufficient, that seems like a good path forward for getting something working today. Suspect even if links were added in Zarr 3.0 it would be an extension protocol (so not guaranteed to be implemented). Whereas attributes already make sense to keep in Zarr 3.0. So there's a good chance an attribute based solution will work seamlessly between Zarr 2 and 3.

jakirkham commented 4 years ago

@oruebel @chrisroat, I wonder if you could stop by the Zarr meeting in 2 weeks? This would be an interesting use case to discuss and timely as well since we are working on the Zarr v3 spec. Details in issue ( https://github.com/zarr-developers/community/issues/1 ). Please check the latest comment for agenda, call link, and meeting time.

cc @thewtex (who may be interested in this or know others who would be interested in this as well)

chrisroat commented 4 years ago

Yeah, I could stop in. It's on the calendar. Tagging @bendichter from NWB.

I think MATLAB would be an important use case, from my brief time in neuro now. https://github.com/zarr-developers/community/issues/16#issuecomment-610058976

thewtex commented 4 years ago

@jakirkham I will plan on attending the next meeting, too.

@mgrauer

jmdelahanty commented 3 years ago

Hello! I just learned about the Zarr library and it looks pretty neat! I was wondering how I might help work on something like this. Are there any active branches looking into Zarr?

oruebel commented 3 years ago

Are there any active branches looking into Zarr?

@jmdelahanty PR https://github.com/hdmf-dev/hdmf/pull/98 on HDMF implements a Zarr backend and this https://github.com/NeurodataWithoutBorders/pynwb/pull/1018 is the corresponding PR on PyNWB to setup the Zarr backend for NWB. The PRs are a bit stale but since they mostly add new functionality, I don't think syncing them with the current dev branches should be too hard. The latest state (before I had to put those PRs to the side) was that the code was working and mostly fully functional. The main piece missing in the HDMF PR is to work out some of details with links when converting from HDF5 to Zarr (and vice-versa). I.e., with the code from the PRs I was able to read/write NWB data to/from Zarr as well as convert NWB files from HDF5 to Zarr via the export function (but as I said there are a few corner cases with links that need to be worked out). Dealing with links is in general one of the main hurdles, as Zarr does not support links and object references nativel and so the PRs implement a custom solution for links (essentially storing definitions of links as JSON and adding reserved attributes to help distinguish between regular datasets/groups and links).

In terms of publications related to this, the following papers may be of interest:

A. J. Tritt, O. Rübel, B. Dichter, R. Ly, D. Kang, E. F. Chang, L. M. Frank, K. E. Bouchard, “HDMF: Hierarchical Data Modeling Framework for Modern Science Data Standards,” IEEE International Conference on Big Data (Big Data), Los Angeles, CA, USA, December 2019, pp. 165-179. https://ieeexplore.ieee.org/document/9005648
D. Kang, O. Rübel,S. Byna, and S. Blanas, "Predicting and Comparing the Performance of Array Management Libraries," 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), New Orleans, LA, USA, 2020, pp. 906-915, doi: 10.1109/IPDPS47924.2020.00097.

Another approach to working with Zarr is to create JSON metadata files to allow Zarr to read from HDF5 files directly. @bendichter had experimented with this at some point and was able stream NWB data from S3 via Zarr. However, since HDF5 now (since h5py 3.2) also has an S3 driver, we have been focusing more on that approach for reading data from S3. Support for h5py 3.x (and S3 read support) will be part of the upcoming main HDMF/PyNWB releases.

I was wondering how I might help work on something like this.

What are your main interests and use-cases for NWB+Zarr. So far this work has been mainly exploratory to evaluate what we can do with Zarr. However, so far this work has not made it into production-ready state, among others because 1) the overhead for supporting multiple storage backends; 2) lack of support for Zarr in other languages (e.g., supporting Zarr in MatNWB will be tricky); 3) lack of funding for Zarr integration; and 4) a lack of clear use cases that require Zarr that would justify the effort needed to make this production-ready. Happy to chat if this is something you are interested in diving in more.

oruebel commented 1 year ago

This is being addressed in https://github.com/hdmf-dev/hdmf-zarr

jmdelahanty commented 1 year ago

I somehow never saw your reply to me so I'm sorry I missed out on seeing this come to fruition, but congratulations!! Very cool!

jmdelahanty commented 1 year ago

I somehow never saw your reply to me so I'm sorry I missed out on seeing this come to fruition, but congratulations!! Very cool!

NeurodataWithoutBorders / pynwb

Evaluate Zarr as a possible alternate array data / storage backend #230

Possible Future Enhancement

Problem/Use Case