Bears-R-Us / arkouda

Arkouda (αρκούδα): Interactive Data Analytics at Supercomputing Scale :bear:
Other
242 stars 87 forks source link

support for Zarr I/O #687

Closed rabernat closed 2 years ago

rabernat commented 3 years ago

Thanks for creating and sharing this amazing package!

I represent the Pangeo community, a group of scientists and software devs working on big-data geoscience research. We currently use Dask heavily for distributed parallelism but would like to evaluate Arkouda to see if it meets our needs.

Our community has been involved in developing a new array storage library called Zarr. Zarr emerged as an alternative to HDF5 with better support for parallel writes and cloud object storage. Its performance compared to HDF5 is favorable in certain scenarios. It now has implementations in Python, Java, C++, Julia, and JavaScript. Much of our existing data is in Zarr.

I was wondering if Zarr was on your radar and if you would consider supporting it as a distributed I/O format in Arkouda.

Thanks again for your work on open source. I sincerely appreciate your time and effort.

ronawho commented 3 years ago

Hi @rabernat, I'm not on the core Arkouda team so I can't speak to the support aspect, but I find this interesting since I've been looking at HDF5 performance in https://github.com/mhmerrill/arkouda/issues/632. Are there C bindings for Zarr? Arkouda is backed by Chapel and Chapel has good interoperability with C, but C++ interop is trickier.

Pinging @mhmerrill and @reuster986 in case they haven't seen this issue.

rabernat commented 3 years ago

C bindings for Zarr are currently in development by @DennisHeimbigner of Unidata. Perhaps he can give us an update on his progress?

ronawho commented 3 years ago

Are the Python and Java implementations native? I was originally guessing they were calling out to C bindings, but that doesn't make sense if the C bindings are still in development.

rabernat commented 3 years ago

They are native. Part of the appeal of Zarr is that the spec is simple enough that it is relatively easy to reimplement from any modern language that supports blosc compression.

DennisHeimbigner commented 3 years ago

The next netcdf-c (https://www.unidata.ucar.edu/software/netcdf/) release (version 4.8.0) is due out shortly. It supports reading/writing of zarr thru the netcdf-c API, which is a C language library. WRT to Chapel (assuming it refers to this https://chapel-lang.org/), there would be a couple of issues. One problem is that you would need to wrap the netcdf-c API for chapel. This is probably not too difficult since wrappers for the netcdf API exist for a large number of languages already. Perhaps the bigger problem is that currently and for historical reasons, the netcdf-c library does not support threading. However, this is a high-priority for Unidata especially for the zarr implementation. You can find some documentation on the netcdf zarr (aka NCZarr) in this document:

https://github.com/Unidata/netcdf/blob/master/NUG/nczarr.md

rabernat commented 3 years ago

Thanks @DennisHeimbigner for your quick reply!

ronawho commented 3 years ago

They are native. Part of the appeal of Zarr is that the spec is simple enough that it is relatively easy to reimplement from any modern language that supports blosc compression.

Oh, that's interesting. I wonder if it'd be easy enough to add a native Chapel port that called out to c-blosc or something.

ronawho commented 3 years ago

@DennisHeimbigner yeah, it's https://chapel-lang.org/.

Chapel has wrappers for NetCDF and HDF5 C libraries. The HDF5 ones are used in Arkouda for HDF5 support and we make sure that only one thread per node calls into HDF5 because of the lack of threading support. This limits performance at low node counts, but performance does usually scale as you add more nodes.

For NetCDF I think the Chapel wrappers were created a year or two ago for 4.6.1, but I don't think it'd be too hard to update to a newer version.

DennisHeimbigner commented 3 years ago

ok. Probably the only major change is to the filter support functions to be more inline with the underlying HDF5 filter support. In order to use the zarr support all you should need to do is:

  1. get the amazon aws-sdk-cpp library installed so you can use the Amazon S3 support.
  2. making sure that Chapel can pass a URL as the path argument to nc_open and nc_create. You can see some examples of the URL format in the document to which I referred.

If you want to try to use it in advance of the 4.8.0 release, you can try pulling and building this branch into a directory called xarray using this command:

git clone https://github.com/DennisHeimbigner/netcdf-c.git --depth=1 -b nczarr_xarray.dmh xarray

This will produce a shallow clone. Watch out about overwriting some existing directory named xarray

This version support the xarray _ARRAY_DIMENSIONS convention. If you do not need this, then you can pull the current netcdf-c master from github.

rabernat commented 3 years ago

Dennis I'm pretty sure this community is focused on HPC, not cloud. Is the AWS SDK required if you just want to do Zarr on disk?

mhmerrill commented 3 years ago

@rabernat we have been discussing how to support multiple I/O formats without making the codebase overly complex. This might result in some sort of I/O abstraction layer for Arkouda. Others have already asked for Parque format also. We currently only have 4 people on the core team, that said we have accepted PR from others like @ronawho. I am open to the discussion and development to support other I/O formats but it will probably require help.

DennisHeimbigner commented 3 years ago

The sdk is only required if you plan to use the cloud. I forgot to mention that using zip files requires access to the libzip library (https://libzip.org/), but that is an easy build. You mentioned Parquet; is there any known mapping between Parquet and zarr or Parquet and S3?

rabernat commented 3 years ago

Parquet is a sharded columnar storage format appropriate for tabular data. It can be used on cloud storage or disk. It shares some of the "cloud optimized" properties of Zarr, but it is a separate format.

mhmerrill commented 3 years ago

@rabernat maybe this is a new feature/support request for the Chapel team and/or the HPE AI team. We originally chose HDF5 because it was supported both in Python and Chapel. What do you think @bradcray @ben-albrecht ?

bradcray commented 3 years ago

@mhmerrill / @rabernat: From the Chapel perspective, we're open to adding support for I/O interfaces and formats that our current users want or that would help grow our community (where growing Arkouda's community would also grow ours). If Zarr meets those criteria, we'd certainly be open to taking it on. Opening a feature request on Chapel's GitHub issues page proposing what's desired and what it would take would probably be the best way to make the request.

To @DennisHeimbigner's point, lack of thread safety is obviously not ideal, but we've run into that with other I/O libraries, so it's not a showstopper either; it just means that either we (or the user) have to be careful to not call into the routines in parallel (I assume?)

Of course, accessing a library supporting a C interface from Chapel doesn't require any support from the Chapel team in that we have features designed to help users call from Chapel to C directly (such as c2chapel or extern blocks and declarations). That said, there still may be reasons to make the feature request such as (a) creating a more Chapeltastic port of the interface rather than a C-like transliteration; (b) leveraging the enthusiasm of the open-source community (where currently a large number of students are descending looking for projects that would prove them to be good GSoC candidates). However, a typical first step to getting any nice interface to a library in Chapel is to get the C-level interface working first, and then wrapping it.

stress-tess commented 2 years ago

It seems like we've decided this issue is best suited for the chapel team and should be moved to Chapel's GitHub issues page if you still want it worked.

For that reason, I'm going to recommend this issue be closed

mhmerrill commented 2 years ago

I concur with closing until Chapel support is there.

ronawho commented 2 years ago

FWIW even if this is something the core chapel team works on, I would expect we'd add support to Arkouda first. Much like what happened with the recent parquet support. That said, it doesn't really matter to me whose repo the feature request is opened against.

@rabernat just checking in, is this something you're still interested in?

rabernat commented 2 years ago

Thanks for checking! Yes we are still interested, but I wouldn't characterize this as especially urgent.

There's a bit of a 🐔 vs 🥚 situation here. Our community doesn't currently use Arkouda, but we are curious about it. Implementing Zarr support would make it easier for us to explore Arkouda. However, since we are not currently active Arkouda users, we are unlikely to become vocal advocates for this feature among your user base. Therefore, it will probably appear to you that this is not a high priority.