Closed meridionaljet closed 1 year ago
Fixed the failing code format check
Patch coverage: 100.00
% and project coverage change: +0.03
:tada:
Comparison is base (
2b2e190
) 95.62% compared to head (001f003
) 95.65%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Many thanks @meridionaljet , this is a really nice improvement - I just added a couple of comments above, then I think we're close to merging it in!
Requested tweaks by @iainrussell have been implemented
Thank you @meridionaljet ! I really like this improvement, and the fact that you added documentation, a test, and also a way to disable it in case of it being used as part of a long-running server. Thanks also for being patient with my suggestions and taking them on board, I think this solution is nice because it works 'out of the box' and does not have the risk of corrupted xarrays if the incoming GRIB file has multiple geometries. Thanks again!
For kerchunk's use, we would really most like to simple not calculate coordinates at all, as we can store them elsewhere. If it were possible, then, to just skip the bytes that define the geometry to the actual measurements in a given message, all the better. Do you think this is possible?
Hi @martindurant, could you create a new issue for this use case please? It would be good to see an example of a GRIB file and how you would like the resulting xarray to look. It's not clear if you want to remove all the coordinates, including the time and vertical dimensions, and if this is for performance, memory or aesthetics. So if if it really would be useful, pop it in another issue and we can discuss there! Cheers, Iain
Edit: cfgrib 0.9.11.0 incorporating these changes has now been released! :grinning:
~This pull_request greatly increases the speed of our workflow. However, installing from source is somewhat of a hassle. @iainrussell, I see you are recently doing work on this repository again. Are there plans for a new release soon? It would greatly help us, and I am sure a lot of other people using grib files and xarray 😄 .~
~(I couldn't really think of another place to ask this, so I hope this way is OK.)~
This is an updated implementation of #338 , addressing a massive performance bottleneck when opening a GRIB file as an
xarray
dataset. Currently,cfgrib
callscfgrib.dataset.build_geography_coordinates()
for every parameter in the index when creating a dataset. Each call requireseccodes
'sgrib_get_array
to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and almost always unnecessary since GRIB files typically have identical grids for each record.This pull request introduces automatic caching of geographic coordinate data by default when calling
cfgrib.open_dataset()
orcfgrib.open_datasets()
. The caching logic is embedded intocfgrib.dataset.build_variable_components()
, utilizing the md5sum of the Grid Definition Section of the GRIB file (thanks @iainrussell for that suggestion).This approach reduces the
cfgrib.open_dataset()
time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened withcfgrib.open_datasets()
, the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.The only possible negative side effect that I can see is a small one: the cache must be implemented globally and thus can theoretically grow unboundedly in a long-lived application wherein
cfgrib
opens many different grid geometries. I have thus included a mechanism for the user to opt out of coordinate caching by passingcache_geo_coords=False
tobackend_kwargs
. Practically, this should be a rare need, since the total data size would cause memory issues for a typical user long before the coordinate cache would, and most workflows read a small number of unique grid geometries.The speedup offered here releases a significant bottleneck in data processing workflows using
xarray
andcfgrib
, especially for large files, makingxarray
dataset creation for GRIB almost as cheap as it is for other data formats likeNetCDF
andzarr
.