asascience-open / nextgen-dmac

Public repository describing the prototyping efforts and direction of the Next-Gen DMAC project, "Reaching for the Cloud: Architecting a Cloud-Native Service-Based Ecosystem for DMAC"
MIT License
18 stars 4 forks source link

Using NetCDF4 instead of Zarr format for Cloud-Optimized Data #11

Closed rsignell-usgs closed 1 year ago

rsignell-usgs commented 2 years ago

I was looking at the ingest document and saw that rechunking to Zarr format was mentioned.

The Zarr format was invented to address a perceived problem that the NetCDF4 format was not cloud-performant.

We have since learned that it's not the NetCDF4 format but the NetCDF4 Library that is not cloud-performant.

We can now read collections of NetCDF4 files as a single virtual dataset using the Zarr library in Python with access speeds identical to reading Zarr format data.

Since NetCDF4 files are friendlier to R and other non-Python users, it's friendlier to distribute cloud-optimized data as collections of NetCDF4 files. Python users can still access the collection of NetCDF4 files as a single dataset (lazily of course) in Xarray from an Intake catalog using fsspec's referenceFileSystem (just as if it were in Zarr format).

It's still important to consider rechunking the NetCDF4 files for optimal performance for a variety of use cases.

Probably worth some discussion. :)

jonmjoyce commented 2 years ago

The Data Conversion step mentioned there is meant to be an example or possible step in the ingest process. I've noted that converting everything to zarr is likely a no-go as a requirement, but some data owners may want to provide zarr data. If instead one wants to chunk netcdf files then that is what would occur in that step.

We also need to keep in mind that we'll need to support a variety of file types, not just grids, and so how can we find common ground among those? I think it will come down to consistent metadata standards to describe those objects.

In terms of making data analysis-ready, what is the common ground that we need to look for? Is a bucket of netcdf/zarr files (ala NODD) sufficient? If so what conventions do we need to adhere to in order to make these buckets interoperable?

jonmjoyce commented 2 years ago

In reference to NODD, I think we also need to explore workflows where data has already been uploaded to the cloud but we want to make it discoverable and interoperable in DMAC without making copies of the data.

mwengren commented 2 years ago

Thanks for the input @rsignell-usgs! In a way, this is reassuring to read, however since the last 4+ years Zarr has been all the rage in terms of cloud data optimization, it's probably not cut and dry as far as what format to use for different use cases.

Are there advantages to using Zarr rather than netCDF, if you're willing for example to overlook R compatibility as an advantage (for argument's sake)? How difficult is it to re-chunk and generate the referenceFileSystem metadata for a large netCDF dataset vs convert to Zarr and re-chunk in the process? Maybe it isn't, I'm not sure. What about other APIs compatibility with referenceFileSystem metadata - can OGC EDR API read it for example as it does with Zarr?

We can do some experimentation as part of this project. Maybe we should convert a few good candidate datasets to both Zarr and netCDF/RFS and put them both on an open-access cloud object store, update in real-time (as in forecast model output), and see which attracts more use and/or can be used in more downstream products that we either build ourselves or others do.

Also, should we take into account analyses that have already been done about the pros/cons for different cloud-optimized formats? For example, the ESIP Cloud Computing cluster guide - but maybe this is outdated already?

rsignell-usgs commented 1 year ago

@mwengren, the kerchunk/referenceFileSystem approach uses the Python Zarr library to read the collection of NetCDF files, so yes, any Python tool that reads the Zarr format can read collections of NetCDF4 files (or GRIB files) via referenceFileSystem

mpiannucci commented 1 year ago

I am going to take a stab to quickly prototype out kerchunking GFS and OFS data from NODD. A few steps I will take:

  1. Kerchunk a single model run output of GFS (or GFS Wave)
  2. Implement something smart that can do automatically dedupilcate variables, making sure that the variables kept are from the newest model runs. This is easy with xarray but need to figure out the flow with kerchunk. This is how we create the TDS "Best" dataset.
  3. Test serving up the data with zarrdap or restful-grids (our code-sprint hack router extending xpublish)

Beyond this, the next steps would be creating intake catalogs for downstream consumers to read from. But going to start at a high level to get through the functionality we rely on from TDS in our production systems.

rsignell-usgs commented 1 year ago

@mpiannucci , this sounds great!

BTW, I have a message in my inbox about upcoming development efforts on zarrdap. I'm going to ask them if they are okay discussing the effort out in the open at https://github.com/NCEI-NOAAGov/zarrdap/issues

rsignell-usgs commented 1 year ago

@mpiannucci here's the notebook we started on together yesterday, which kerchunks a single TBOFS file: https://jupyter.qhub.esipfed.org/hub/user-redirect/lab/tree/shared/users/rsignell/notebooks/COAWST/rpsgroup_kerchunk.ipynb