Idea: Read Grib2 directly from object storage using the Zarr library?

rsignell-usgs commented 3 years ago

Cfgrib Enthusiasts,

Earlier this week we (@martindurant and @ajelenak) published a blog post about accessing netCDF4/HDF5 files in a cloud-performant way using the Zarr library.

On a Pangeo call on Wednesday, @rabernat suggested trying the same approach with remote grib2 files.

Seems like the two things we would need are:

code to extract the byte range locations of the chunks in the grib2 file
the ability for Zarr to decode the compression schemes used in grib2.

If we could get this going, it would be pretty amazing. There is a lot of forecast model output being pushed to the cloud in grib2 format, and we could even create a "best time series" dataset that uses chunks from multiple forecast grib files!

What do you grib2 gurus think of this idea?

mhaberler commented 3 years ago

seconded - I believe efficient remote access to subsets of gribs is key and that works fine with netCDF and opendap - if you use xarray + dask you get lazy loading on top

this is the reason why I convert gribs to netCDF - make them range-accessible

rsignell-usgs commented 3 years ago

GRIB2 gurus, on this wgrib2 page it's stated that:

complex packing with spatial differencing is a good trade off between speed and size

Are there just a few grib2 compression/filter options that are commonly used?

martindurant commented 3 years ago

Having worked on fsspec-reference-maker for the HDF5 case, I second the proposal of generating reference files for GRIB2 data, to allow referencing whole structured sets of grib as a single zarr dataset, and so allow parallel access and logical subselection.

This appears to be the list of possible data representation types in GRIB2: https://www.nco.ncep.noaa.gov/pmb/docs/grib2/grib2_doc/grib2_table5-0.shtml

Each of those links to further details, and each type has many options. It's rather complicated! Coordinates can be functional (linear or geometric) or explicit. I am thinking that the code here would be necessary to unpack the coordinates to be stored directly in the references file.

It may be that GRIB is complex enough that pointing to the binary pixel data and encoding with zarr is not feasible, but the case is similar to the many-hdfs that could be addressed using zarr, but with this original loader (i.e., this project) at the chunk level. cc @DPeterK

DPeterK commented 3 years ago

Fully in support of this proposal. The importance of GRIB2 as a file format in the weather community should not be underestimated either: it's the WMO's standard file format for storing and sharing weather data.

It's been a while since I've done anything with GRIB2, but I think it should definitely be possible to represent the contents of GRIB2 files in Zarr. The structure fileformat itself is also pretty simple: a file is made up of one or more messages, where each message is a 2D horizontal field (lat/lon, say), and the all the fields in the file can be stacked to make nD arrays of one or more weather phenomena (such as air temperature). Each message in turn has a well-described metadata header composed of seven sections, followed by the data points for the message -- my point being that byterange requests should be programmatically achievable given the internal structure of GRIB2 files.

The challenge, to some extent, is as @martindurant notes - the metadata header is quite hard to interpret. Each of the seven sections is of a well-known standard length, with number:value pairs to the length of the header section. However, the meaning of each value is changed depending on earlier values, so determining the meaning of each value requires interrogation of lookup tables. For example, the grid definition section of the metadata header (section 3 or 4, I think...) defines how the horizontal grid of the message should be interpreted or constructed, and the type of horizontal grid (lat/lon, rotated lat/lon, Mercator etc...) is defined by the value of one of the numbers early on in the header section.

Hope this helps! (And isn't just saying what everyone already knows...) Happy to provide more information and help out with this idea if that would be valuable. And thanks @martindurant for the CC!

martindurant commented 3 years ago

So, @DPeterK , you might just be the best person to attempt this... Assuming for a particular variable of the dataset, the metadata is identical; can one assume that the byte offsets to the pixel data are identical too? The HRRR files have a regular path scheme like "s3://noaa-hrrr-pds/TMP/2 m HGHT/20210113/2200/016", and I assume we can infer from the path alone the file's coordinate position in an ensemble dataset.

So the question is: can we reliably make a mapping of filepath/offset/size to a binary block (each containing a single 2D slice), and know at what index position this thing should be within a single data set containing all files for that variable? Of course, I am assuming that we can decode the binary blob into an in-memory array, which may mean implementing JPEG, etc., in numcodecs, another problem. Looking through the cfgrib code, it seems that everything is in place upontil the call to eccodes.codes_grib_new_from_file, which is opaque and expects a local file pointer.

cdibble commented 3 years ago

I want to voice support for this proposal. I would find it very useful to be able to open GRIB files stored in AWS S3 or from an in-memory bytes object.

TomAugspurger commented 2 years ago

What are the actual work items here? I see that @martindurant has https://github.com/fsspec/kerchunk/pull/27 and @rsignell-usgs has https://nbviewer.org/gist/rsignell-usgs/ae7bfb16e8a4049bb1ca0379805c4dc2. Is there anything more to do to support grib2 through this reference filesystem approach?

martindurant commented 2 years ago

Yes, I would say that grib2 is "done" for kerchunk in a usable way.

Originally we had hoped t be able to extract just the binary array from each "message" (part of a grib2 file) rather than have to download a whole message and still rely on cfgrib to read it. That would be nice, but I'm not aware of anyone working on it.

TomAugspurger commented 2 years ago

Gotcha, thanks. So currently building the reference file with kerchunk requires downloading the grib file locally, but once you have the offsets you can read the bytes directly with HTTP range requests?

I agree that reading messages from a remote URL using cfgrib would still be useful (https://github.com/ecmwf/cfgrib/issues/188) but I don't know how much work that would be.

martindurant commented 2 years ago

So currently building the reference file with kerchunk requires downloading the grib file locally

yes

once you have the offsets you can read the bytes directly with HTTP range requests

You fetch the bytes range of the whole message, containing attributes, coordinates and the actual data. The first two of these ought to pack very nicely, and the majority of the bytes should be the bytes you actually want. However, when reading the message with cfgrib, it will unpack the coordinates into potentially large arrays in memory; those are immediately dropped after read.

DPeterK commented 2 years ago

@martindurant I definitely missed a GitHub notification... a year ago 😑 FWIW, then, I think you can make the assumption of constant byte offsets for particular variables of the same dataset. The headers of all sections of each GRIB message / file should have consistent length for the header for each section, as all key/value pairs in the section must be present in the header, with an empty/null value set if that header element isn't specified in this case. And assuming that all variables of the dataset have the same size, the data bytes should be the same too - but the risk is that they aren't, for some reason. This could be a different datatype, or an extra coordinate, or extra metadata - precisely the kind of things that make interacting with data challenging in the first place...

One way you could make the assumption more robust is to read the first key/value pair from each header, which gives the total length (including the header) of the current section. That way you'd know the precise byterange of the data for each section for each message. Doing this would allow you to make that mapping between files and offsets to bytes of data - if on a per-file basis.

martindurant commented 2 years ago

@DPeterK - the way I am calling cfgrib, I am not touching any of the internals of the message, so I cannot get the real file offsets to anything except whole messages. I don't understand the cfgrib API well enough to know if I can do what you are saying - but it would be great! We do download and scan every grib file anyway, and storing such offsets is exactly what kerchunk is for. If we can get the offset, size and encoding of the data portion of a message, then we don't need cfgrib at all at runtime, and things get much simpler and efficient.

ecmwf / cfgrib

Idea: Read Grib2 directly from object storage using the Zarr library? #189