Open martindurant opened 1 year ago
The current kerchunk grib decoder: https://github.com/fsspec/kerchunk/blob/main/kerchunk/codecs.py#L87 (eccodes, not cfgrib)
Hi Martin,
Sorry for the delay in getting back to you, very busy :)
To be honest, I'm not sure that cfgrib will help you here. kerchunk's current implementation using eccodes directly looks ok to me with one exception, which I'll come to!
cfgrib will always try to generate lat/lons, as its purpose is to create a hypercube that includes the geographical information where possible. I'm not sure what extra value cfgrib gives you over the current eccodes-based implementation, even if we removed the geometry.
As for the problem I see with your current implementation, it looks like you are using the default eccodes missing value. This is set, unfortunately, to 9999, which means that any missing values in the GRIB will be returned as 9999. This of course could clash with valid values in the data. The missingValue
key in eccodes is actually writable. What we do in cfgrib is to set it like this:
self["missingValue"] = np.finfo(np.float32).max
Now, when you ask for the values, any missing values will be returned as np.finfo(np.float32).max, which should not clash with any data.
By the way, 'kerchunk' is a fantastic name :)
cfgrib will always try to generate lat/lons
I should say, I am basically ignorant of what eccodes does. I don't even know if it produces the geometry proactively, but I would imagine "no", since cfgrib has the option now to intercept it. We ould probably establish the case by looing at memory monitoring. I also don't know whether the geometry and other metadata definitions ever make up a significant fraction of the bytes of a message on-disk (as opposed to the actual variable values, the payload).
Thanks for the tip about missing values. We can fix that. (@emfdavid , in case you have come across this)
By the way, 'kerchunk' is a fantastic name
Not everyone agrees, but I'm glad you like it.
Hi Martin,
In case it helps, here's what happens with the geometry: a GRIB message contains only a scant description of the geometry. For a regular lat/lon grid for example, it contains the N/S/E/W bounds plus the lat/lon increments in degrees (plus a scanning mode, but that's a detail). So it's literally just a few bytes on disk. cfgrib then asks the ecCodes library for the list of latitudes and longitudes for the grid; ecCodes then computes them from the description I mentioned a moment ago. So any cost is not in terms of disk access, it is a little computational power to compute the lists of lats and lons, plus the memory to store them.
However, the more I think about it, the more I can see that many computations do not require the lats and lons (e.g. simply computing a monthly mean across all points regardless of their location), so I can see the option to disable geographical coordinate generation off as being generally useful. I will look into it! In the meantime, I wish you a good weekend!
As for 'kerchunk', I guess whether people like the name depends on whether they played a certain marble-based game in their childhood...!
Is your feature request related to a problem? Please describe.
No response
Describe the solution you'd like
Following from https://github.com/ecmwf/cfgrib/pull/341#issuecomment-1589541971
kerchunk is a library for extracting out the constituent data buffers from various data storage formats, and organising them into zarr datasets, potentially across many input files. By "extract", I mean: find the extract byte-range qnd write this to a "references" file, so that the logical zarr dataset created does not need to duplicate any of the data, which remains in-situ.
Kerchunk's GRIB2 support currently regards each grib message as a "chunk" in this sense, and a whole message is loaded and decoded by eccodes for each chunk - we save the start/end of each message. Actually, the coordinates of all of the chunks have already been considered at this point, and the location of each chunk in the overall dataset determined, so the coordinates portion of the message (as opposed to the actual variable payload) is unnecessary.
We would like, if possible, to extract the byte range of the actual payload rather than a whole message. I appreciate that since #341, the coordinates no longer will be constructed for each chunk, but it would be nice not to even download the bytes that define it. This may also allow simpler decoding of the payload.
Describe alternatives you've considered
No response
Additional context
No response
Organisation
anaconda, fsspec, zarr, pangeo