Open keltonhalbert opened 1 year ago
If it's helpful, I uploaded the grib2 file to a google drive. It's just under 10 MB in size.
Sorry to ping again - I know I posted the issue on a Friday, so my apologies there. Just wanting to know if there are any ideas about where and how kerchunk is missing a variable in the dataset.
I see
210.1:6501899:d=2021041500:UGRD:10 m above ground:anl:
210.2:6501899:d=2021041500:VGRD:10 m above ground:anl:
but kerchunk is only matching one grib message. I'm not certain what the codes in those two lines mean, but I wonder whether they are components of the SAME message - there are indeed 294 messages in the file. Investigating...
I fear I may need a GRIB expert to figure this one out. The values returned in variable u10 are indeed the same values as found by xarray/cfgrib. The offset and message size are correct, but I don't know how the other component of the vector in the same grib message is found.
Thanks for taking a look at this, I really appreciate it! I'm by no means a grib expert, but I'm trying to track down what makes this encoding different. So far I've stumbled across some grib documentation that mentions velocity fields encoded in sub-messages...
I haven't made much progress beyond this, but I'm trying to poke around more documentation with grib2 and cfgrib to see how this is handled. I'll report back if I find anything else useful.
That does seem like it's talking about the same thing, but I don't see how this is handled in cfgrib/eccodes .
I may have at least narrowed down a rough idea of where in cfgrib this happens, but as far as how to activate/enable it, I'm still trying to untangle.
In eccodes/cfgrib, this is referred to as a multi-field grib file. Within cfgrib/messages.py, there are some functions and conditionals that interact with multi_enabled
- presumably, this is how cfgrib/xarray manages supporting this feature. Still trying to reckon with exactly where and how this should be called/enabled/detected by kerchunk.
Alright, I made some progress here. I have no idea if this is the right way to go about this, but I basically used the logic of the FileStream class in cfgrib.messages, and I can successfully print out all variables.
def main():
import eccodes
import cfgrib
from cfgrib import messages
grbfile = "./rap_252_20210415_0000_000.grb2"
fstream = messages.FileStream(grbfile)
with open(grbfile, 'rb') as f:
with messages.multi_enabled(f):
valid_message_found = False
while True:
try:
msg = fstream.message_from_file(f)
print(msg["cfVarName"])
valid_message_found = True
except EOFError:
if not valid_message_found:
raise EOFError("No valid message found")
break
I'll spare dumping the whole output, but now I see v/v10 variables in the output!
hindex
gh
t
r
w
u
v
t
r
w
u
v
unknown
gh
sp
orog
t
unknown
unknown
sdwe
sde
t2m
pt
sh2
d2m
unknown
papt
r2
u10
v10
prate
unknown
acpcp
sdwe
unknown
unknown
ssrun
bgrun
So, it seems like the weird quirk of multi_enabled
is required for cfgrib to handle the parsing of those messages. I'm not sure if you guys would prefer trying to get to an even lower level of handling this, and I don't know if the caveats in the documentation are deal breakers:
#
# MULTI-FIELD support is very tricky. Random access via the index needs multi support to be off.
#
...
#
# Explicitly reset the multi_support global state that gets confused by random access
#
# @alexamici: I'm note sure this is thread-safe. See :#141
#
However, if there are no objections to changing the iteration logic of scan_grib, I can modify it to follow this procedure and make a pull request.
if there are no objections to changing the iteration logic of scan_grib
I am not opposed to that, except that I had hoped to only need eccodes and not also cfgrib during the access phase. Maybe that doesn't matter, so long as we can still interpret a block of bytes as a message.
The GRIB codec would need to be updated to do something similar. The codec reads whole messages at a time, because we don't know how to decode the interior buffers of a GRIB (sub)message unfortunately. That would mean that we need to tell the codec which submessage we want - assuming this is consistent across all input files/messages - and would end up temporarily loading the bytes for both variables when trying to access either one.
MULTI-FIELD support is very tricky. Random access via the index needs multi support to be off.
is interesting, because kerchunk essentially has its own index and always does random access. As above, if only the "where is the buffer" and "decode this bytes buffer" were accessible calls in the eccodes API, life would be much easier for everyone.
If the desire is to keep things purely in terms of eccodes, then I'll double check and see how doable that is before moving forward with the cfgrib parser. Most of the calls under the hood are to eccodes, so it may be possible to rig it to work directly. This was kind of my brute-force "make it work" approach. Let me see what I can do.
Has there been any progress on this issue? I'm running into the same problem with data in the NAM s3 bucket. I'm unable to access see the v10 data using kerchunk, but the u10 message seems okay.
Eli
@emfdavid , I don't suppose you've come across this kind of thinkg in your grib travels?
Hi @imcslatte - apologies for the slow response. Unfortunately due to time constraints and other responsibilities, I wasn't able to revisit this problem in order to fix it using the eccodes API directly. I have a temporary fix listed above that uses the cfgrib API, which is built on top of eccodes.
I might be able to find some time in the next week or two to investigate fixing this directly from eccodes, now that I've been reminded of this... as I also have some upcoming projects with RAP data where having this issue fixed will be helpful :). If you want to implement the stopgap fix locally in your environment, I'd be happy to help.
I have not seen sub messages in the HRRR or GFS/GEFS grib2 files. Yet another new wrinkle! My adventures in parsing grib files are documented here.
@keltonhalbert @martindurant I was wondering if part of the problem might be that the submessage IDs are decimals and not integers? If they are typed as integer the ids 210.1 and 210.2 would be the same.
Just a thought.
@imcslatte , ooh, so
274:8535142:d=2021041500:VVEL:150-120 mb above ground:anl:
275:8573559:d=2021041500:TMP:180-150 mb above ground:anl:
276:8604398:d=2021041500:RH:180-150 mb above ground:anl:
277.1:8634622:d=2021041500:UGRD:180-150 mb above ground:anl: <== this gets handled
277.2:8634622:d=2021041500:VGRD:180-150 mb above ground:anl: <== this gets dropped
278:8697064:d=2021041500:VVEL:180-150 mb above ground:anl:
279:8734600:d=2021041500:4LFTX:180-0 mb above ground:anl:
280:8767961:d=2021041500:CAPE:180-0 mb above ground:anl:
hoping this is the simple problem @martindurant !
I am not sure that we make any use of the ID value. I think the problem is, that the two arrays are bundled in the same grib message, and I don't know how to tell the cfgrib API "load the second sub-message". If I had a spare year, perhaps I could dig into the internals of grib to understand this, but for now I'll have to rely on people like @mpiannucci !
Yeah the main problem is how GRIB sub-messages get handled. Per the eccodes documentation, sub-messages are not a feature that is recommended to be used, but NCEP has been doing this for RAP and NAM products (not HRRR or GFS though) for a while .
The reason it works in cfgrib but not kerchunk is because cfgrib incorporates special eccodes logic in order to account for grib sub-messages, where kerchunk does not implement said logic. As far as I can tell, the bulk of the cfgrib calls to eccodes that kerchunk needs to duplicate are contained within the messages file.
#
# MULTI-FIELD support is very tricky. Random access via the index needs multi support to be off.
#
eccodes.codes_grib_multi_support_off()
@contextlib.contextmanager
def multi_enabled(file: T.IO[bytes]) -> T.Iterator[None]:
"""Context manager that enables MULTI-FIELD support in ecCodes from a clean state"""
eccodes.codes_grib_multi_support_on()
#
# Explicitly reset the multi_support global state that gets confused by random access
#
# @alexamici: I'm note sure this is thread-safe. See :#141
#
eccodes.codes_grib_multi_support_reset_file(file)
try:
yield
except Exception:
eccodes.codes_grib_multi_support_off()
raise
eccodes.codes_grib_multi_support_off()
The multi_enabled
function gets called when iterating over the number of messages in a file:
@attr.attrs(auto_attribs=True)
class Message(abc.MutableField):
"""Dictionary-line interface to access Message headers."""
codes_id: int
encoding: str = "ascii"
errors: str = attr.attrib(
default="warn", validator=attr.validators.in_(["ignore", "warn", "raise"])
)
@classmethod
def from_file(cls, file, offset=None, **kwargs):
# type: (T.IO[bytes], T.Optional[OffsetType], T.Any) -> Message
field_in_message = 0
if isinstance(offset, tuple):
offset, field_in_message = offset
if offset is not None:
file.seek(offset)
codes_id = None
if field_in_message == 0:
codes_id = eccodes.codes_grib_new_from_file(file)
else:
# MULTI-FIELD is enabled only when accessing additional fields
with multi_enabled(file):
for _ in range(field_in_message + 1):
codes_id = eccodes.codes_grib_new_from_file(file)
if codes_id is None:
raise EOFError("End of file: %r" % file)
return cls(codes_id=codes_id, **kwargs)
Of particular note is checking if field_in_message == 0
, which appears to come from wherever the offset
tuple is passed from.
Now, as far as addressing this issue, my understanding is the following:
multi_enabled
within kerchunk. Presumably, that starts with coming up with a means of detecting the presence of multi-field grib messages from the eccodes API, and then inserting it accordingly.I feel pretty confident in being able to come up with a solution, but not confident that it'll be the correct solution. I guess the only way to know for sure is to try something in a fork of the repository. So, my question becomes, do the developers/maintainers have any input before diving head first into a naive solution?
As another potentially helpful datapoint: I tried using kerchunk on ERA5 single level grib files and encounter the same issue (downloaded from ECMWF (https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels) to our network share, but still painfully slow to load, so I thought I would give kerchunk a try).
Here, the grib file contains 4 wind variables, u100
, v100
, u10
and v10
, of which only u100
is picked up by kerchunk.
Furthermore, the file contains a time
dimension of 24 hours, which is recognized by cfgrib, but not by kerchunk:
'time/.zarray': '{"chunks":[],"compressor":null,"dtype":"<i8","fill_value":null,"filters":null,"order":"C","shape":[],"zarr_format":2}',
.
This might be related to https://github.com/fsspec/kerchunk/issues/150, although here time
is missing instead of steps
as this is a reanalysis product.
I would love to help, but unfortunately, I am also not a grib expert. Also happy to open a separate issue is that helps.
I am also not a grib expert
Is anyone? :)
@TAdeJong I am currently facing the same issue where Kerchunk is not able to "detect" the 24 different hours within the file. Have you found a solution to the problem?
@mpiannucci - have you had any chance to work with sub-messages?
I am pretty sure that the copernicus climate store exports single level data in a grib1 file, gribberish cant even read it because it doesnt contain all the grib2 sections when scanning through, when i downloaded a 24 hour file with u100, v100, u10, v10. I may be wrong though, i havent worked with euro data much.
I am also struggling with ECMWF data from the new CDS beta API due to a missing time
dimension. In hopes of getting any "grib expert" a head-start on this, here's a minimal example with a nice tiny 4MB GRIB, also available via Google Drive.
import fsspec
import xarray as xr
from kerchunk.grib2 import GribToZarr
DOWNLOADED_GRIB = "/tmp/downloaded.grib"
def download_grib():
"""You don't actually need to run this. Instead download the grib to the
DOWNLOADED_GRIB. location. This is included for completeness.
"""
import os
import cdsapi
# os.environ["CDSAPI_KEY"] = "xxxxxxxx"
dataset = "reanalysis-era5-single-levels"
request = {
"product_type": ["reanalysis"],
"variable": ["2m_temperature"],
"year": ["2024"],
"month": ["01"],
"day": ["01", "02"],
"time": ["00:00"],
"data_format": "grib",
"area": [90, 0, -90, 360],
}
c = cdsapi.Client(url="https://cds-beta.climate.copernicus.eu/api")
c.retrieve(dataset, request, DOWNLOADED_GRIB)
# download_grib()
ds0 = xr.open_dataset(DOWNLOADED_GRIB, engine="cfgrib")
print("Loaded with xarray:")
print(ds0)
print()
refs_for_messages = GribToZarr(DOWNLOADED_GRIB).translate()
assert len(refs_for_messages) == 1
ref = refs_for_messages[0]
fs = fsspec.filesystem("reference", fo=ref)
m = fs.get_mapper("")
ds = xr.open_zarr(m, consolidated=False).load()
print("Loaded with kerchunk:")
print(ds)
Loaded with xarray:
<xarray.Dataset> Size: 8MB
Dimensions: (time: 2, latitude: 721, longitude: 1440)
Coordinates:
number int64 8B ...
* time (time) datetime64[ns] 16B 2024-01-01 2024-01-02
step timedelta64[ns] 8B ...
surface float64 8B ...
* latitude (latitude) float64 6kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
* longitude (longitude) float64 12kB 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
valid_time (time) datetime64[ns] 16B ...
Data variables:
t2m (time, latitude, longitude) float32 8MB ...
Attributes:
GRIB_edition: 1
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_subCentre: 0
Conventions: CF-1.7
institution: European Centre for Medium-Range Weather Forecasts
history: 2024-07-31T13:59 GRIB to CDM+CF via cfgrib-0.9.1...
Loaded with kerchunk:
<xarray.Dataset> Size: 8MB
Dimensions: (latitude: 721, longitude: 1440)
Coordinates:
* latitude (latitude) float64 6kB 90.0 89.75 89.5 ... -89.5 -89.75 -90.0
* longitude (longitude) float64 12kB 0.0 0.25 0.5 0.75 ... 359.2 359.5 359.8
number int64 8B 0
step timedelta64[ns] 8B 00:00:00
surface float64 8B 0.0
time datetime64[ns] 8B 2024-01-01
valid_time datetime64[ns] 8B 2024-01-01
Data variables:
t2m (latitude, longitude) float64 8MB 246.6 246.6 ... 248.0 248.0
Attributes:
GRIB_centre: ecmf
GRIB_centreDescription: European Centre for Medium-Range Weather Forecasts
GRIB_edition: 1
GRIB_subCentre: 0
institution: European Centre for Medium-Range Weather Forecasts
Note how for the same file when loaded with xarray, time
is a dimension of length 2, while with kerchunk it's a coordinate of length 1.
@mpiannucci , interestingly, gribberish doesn't parse the given file at all, but gives
ValueError: No valid GRIB messages found
This might be an interesting test case for you. I'd say there's a better chance, in the long run, of getting the sub-messages out of gribberish than eccodes (which requires the global "multi" state for reading this).
Well, I found some time to try and give this a good, long wrestle and I have come to the conclusion that the problem exists within eccodes.codes_new_from_message
. See this issue I opened with eccodes-python for more detail.
In short, eccodes.codes_grib_multi_support_on()
does not appear to have any influence on eccodes.codes_new_from_message
, meaning that streaming bytes with multi-field enabled grib2 messages may not be possible. I am hoping I am either 1) incredibly wrong, and the folks at ECMWF can clarify how to handle this with eccodes, or 2) maybe this can be fixed/addressed and make everyone happy. However, I'm just not sure where to go from here.
Some progress that I made is that scan_grib can now at least recognize the presence of the v10 arrays within a multi-field message, and can be tested in my personal fork. This was only achievable by using the fsspec file handler and passing it to eccodes.codes_new_from_file
. The catch is, however, that when decoding/reading the arrays from grib2, the u10 and v10 arrays are identical because Kerchunk doesn't know where the v10 array starts. Presumably, this is because things get handled by codecs.py/GRIBCodec
, which expects to be given a grib2 message as a buffer (just like the current version of scan_grib).
It appears to me that changing GRIBCodec to take a fsspec file is undesirable for multiple reasons, especially since byte streaming is kind of the whole point... but I'm at a loss for what else to do if the problem can't be remedied from eccodes side. I'm certainly out of my depth and grasping at straws to make sense of things, so if anyone with more knowledge and insight into the awfulness of the grib2 world can provide a means of encoding the appropriate array offset for the multi-field arrays, please please PLEASE speak up!
Moving forward, I see a few options...
If we get stuck with option 3, we need to figure out how to preserve what is effectively byte streaming, while tricking eccodes into thinking it's getting a file. Unfortunately, you cannot just pass eccodes a BytesIO object to achieve the desired behavior. I don't know the core of fsspec very well, but I know it was intended to handle remote file streaming, so perhaps the GRIBCodec class needs to be refactored to use fsspec rather than bytes? Input from the maintainers is appreciated...
Edit: One more idea, is that if we know how to brute-force the byte offsets to read the appropriate array, that could work. Unfortunately, there are no grib2 fields/IDs that broadcast the presence of multi-field messages, at least in the eccodes API and certainly not encoded in the file metadata that I can tell.
so perhaps the GRIBCodec class needs to be refactored to use fsspec rather than bytes?
From what you have linked, the C code in eccodes wants to use a file descriptor, i.e., real local open file. That means, fsspec can't do it (except to copy the bytes to a local temporary location, RAMdisk or something). How strange that they should have a completely different way to handle a bytes buffer versus a file!
Unfortunately, there are no grib2 fields/IDs that broadcast the presence of multi-field messages
You said you had code to detect this? I don't see from the commit. But if knowing beforehand is enough, we can store the fact in the init parameters for the grib codec.
@martindurant Sorry for the confusion regarding the commit - that code doesn't explicitly detect the presence of multi-field messages, but by relying on eccodes.codes_new_from_file
with the fsspec handler in the version of scan_grib
in my commit, there is a way to brute-force the detection.
In short, if you keep track of the mid
and offset
variables with something like mid_last
and offset_last
, you can detect a multi-message entry because the mid
will remain the same between loop iterations, but offset
will change. Hopefully that makes some sense? That's the only way I could figure out to detect it so far.
That said, clearly eccodes.codes_new_from_file
is able to detect multi-field grib messages on its own. Perhaps digging into that logic could help provide a means of doing the same for scan_grib
and GRIBCodec
?
It sounds like, we can find multi cases, then, but only if we copy messages to disk first. That's totally doable and what scan_grib did do once upon a time. For decoding at read time, well it would work but be pretty annoying! We could advise users to ensure that their temporary storage is memory-based.
In short, if you keep track of the mid and offset variables with something like mid_last and offset_last, you can detect a multi-message entry because the mid will remain the same between loop iterations, but offset will change. Hopefully that makes some sense? That's the only way I could figure out to detect it so far.
Interesting! Would you mind showing this for an example multi file? I wonder if with the two offsets we have enough information to make two non-multi messages for eccodes at runtime.
I'm encountering an interesting issue where the results of scan_grib differ from interacting with a file via xarray/cfgrib. Particularly, it is not detecting certain variables. Installation information:
In this particular case, the missing variable is the 10 meter V wind component. Using scan_grib:
As you can see, there is no
v10
variable output. Here's the printed output directly from scan_grib, showing that the data is missing here too:However, when I use xarray/cfgrib, the
v10
variable is present and accounted for:Output from wgrib2:
Any suggestions on what I may be doing wrong, or where the issue might lie? Happy to make a PR if there's a bug, just not sure if 1) there is one and 2) where to start addressing it.