ecmwf / cfgrib

A Python interface to map GRIB files to the NetCDF Common Data Model following the CF Convention using ecCodes
Apache License 2.0
407 stars 77 forks source link

Only read payload buffer #343

Open martindurant opened 1 year ago

martindurant commented 1 year ago

Is your feature request related to a problem? Please describe.

No response

Describe the solution you'd like

Following from https://github.com/ecmwf/cfgrib/pull/341#issuecomment-1589541971

kerchunk is a library for extracting out the constituent data buffers from various data storage formats, and organising them into zarr datasets, potentially across many input files. By "extract", I mean: find the extract byte-range qnd write this to a "references" file, so that the logical zarr dataset created does not need to duplicate any of the data, which remains in-situ.

Kerchunk's GRIB2 support currently regards each grib message as a "chunk" in this sense, and a whole message is loaded and decoded by eccodes for each chunk - we save the start/end of each message. Actually, the coordinates of all of the chunks have already been considered at this point, and the location of each chunk in the overall dataset determined, so the coordinates portion of the message (as opposed to the actual variable payload) is unnecessary.

We would like, if possible, to extract the byte range of the actual payload rather than a whole message. I appreciate that since #341, the coordinates no longer will be constructed for each chunk, but it would be nice not to even download the bytes that define it. This may also allow simpler decoding of the payload.

Describe alternatives you've considered

No response

Additional context

No response

Organisation

anaconda, fsspec, zarr, pangeo

martindurant commented 1 year ago

The current kerchunk grib decoder: https://github.com/fsspec/kerchunk/blob/main/kerchunk/codecs.py#L87 (eccodes, not cfgrib)

iainrussell commented 1 year ago

Hi Martin,

Sorry for the delay in getting back to you, very busy :)

To be honest, I'm not sure that cfgrib will help you here. kerchunk's current implementation using eccodes directly looks ok to me with one exception, which I'll come to!

cfgrib will always try to generate lat/lons, as its purpose is to create a hypercube that includes the geographical information where possible. I'm not sure what extra value cfgrib gives you over the current eccodes-based implementation, even if we removed the geometry.

As for the problem I see with your current implementation, it looks like you are using the default eccodes missing value. This is set, unfortunately, to 9999, which means that any missing values in the GRIB will be returned as 9999. This of course could clash with valid values in the data. The missingValue key in eccodes is actually writable. What we do in cfgrib is to set it like this:

self["missingValue"] = np.finfo(np.float32).max

Now, when you ask for the values, any missing values will be returned as np.finfo(np.float32).max, which should not clash with any data.

iainrussell commented 1 year ago

By the way, 'kerchunk' is a fantastic name :)

martindurant commented 1 year ago

cfgrib will always try to generate lat/lons

I should say, I am basically ignorant of what eccodes does. I don't even know if it produces the geometry proactively, but I would imagine "no", since cfgrib has the option now to intercept it. We ould probably establish the case by looing at memory monitoring. I also don't know whether the geometry and other metadata definitions ever make up a significant fraction of the bytes of a message on-disk (as opposed to the actual variable values, the payload).

Thanks for the tip about missing values. We can fix that. (@emfdavid , in case you have come across this)

By the way, 'kerchunk' is a fantastic name

Not everyone agrees, but I'm glad you like it.

iainrussell commented 1 year ago

Hi Martin,

In case it helps, here's what happens with the geometry: a GRIB message contains only a scant description of the geometry. For a regular lat/lon grid for example, it contains the N/S/E/W bounds plus the lat/lon increments in degrees (plus a scanning mode, but that's a detail). So it's literally just a few bytes on disk. cfgrib then asks the ecCodes library for the list of latitudes and longitudes for the grid; ecCodes then computes them from the description I mentioned a moment ago. So any cost is not in terms of disk access, it is a little computational power to compute the lists of lats and lons, plus the memory to store them.

However, the more I think about it, the more I can see that many computations do not require the lats and lons (e.g. simply computing a monthly mean across all points regardless of their location), so I can see the option to disable geographical coordinate generation off as being generally useful. I will look into it! In the meantime, I wish you a good weekend!

As for 'kerchunk', I guess whether people like the name depends on whether they played a certain marble-based game in their childhood...!