[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates

meridionaljet commented 1 year ago

This is an updated implementation of #338 , addressing a massive performance bottleneck when opening a GRIB file as an xarray dataset. Currently, cfgrib calls cfgrib.dataset.build_geography_coordinates() for every parameter in the index when creating a dataset. Each call requires eccodes's grib_get_array to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and almost always unnecessary since GRIB files typically have identical grids for each record.

This pull request introduces automatic caching of geographic coordinate data by default when calling cfgrib.open_dataset() or cfgrib.open_datasets(). The caching logic is embedded into cfgrib.dataset.build_variable_components(), utilizing the md5sum of the Grid Definition Section of the GRIB file (thanks @iainrussell for that suggestion).

This approach reduces the cfgrib.open_dataset() time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened with cfgrib.open_datasets(), the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.

The only possible negative side effect that I can see is a small one: the cache must be implemented globally and thus can theoretically grow unboundedly in a long-lived application wherein cfgrib opens many different grid geometries. I have thus included a mechanism for the user to opt out of coordinate caching by passing cache_geo_coords=False to backend_kwargs. Practically, this should be a rare need, since the total data size would cause memory issues for a typical user long before the coordinate cache would, and most workflows read a small number of unique grid geometries.

The speedup offered here releases a significant bottleneck in data processing workflows using xarray and cfgrib , especially for large files, making xarray dataset creation for GRIB almost as cheap as it is for other data formats like NetCDF and zarr.

meridionaljet commented 1 year ago

Fixed the failing code format check

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03 :tada:

Comparison is base (2b2e190) 95.62% compared to head (001f003) 95.65%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #341 +/- ## ========================================== + Coverage 95.62% 95.65% +0.03% ========================================== Files 26 26 Lines 2056 2073 +17 Branches 236 238 +2 ========================================== + Hits 1966 1983 +17 Misses 59 59 Partials 31 31 ``` | [Impacted Files](https://app.codecov.io/gh/ecmwf/cfgrib/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf) | Coverage Δ | | |---|---|---| | [cfgrib/xarray\_plugin.py](https://app.codecov.io/gh/ecmwf/cfgrib/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-Y2ZncmliL3hhcnJheV9wbHVnaW4ucHk=) | `88.40% <ø> (ø)` | | | [cfgrib/dataset.py](https://app.codecov.io/gh/ecmwf/cfgrib/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-Y2ZncmliL2RhdGFzZXQucHk=) | `98.45% <100.00%> (+0.05%)` | :arrow_up: | | [tests/test\_40\_xarray\_store.py](https://app.codecov.io/gh/ecmwf/cfgrib/pull/341?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-dGVzdHMvdGVzdF80MF94YXJyYXlfc3RvcmUucHk=) | `100.00% <100.00%> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

iainrussell commented 1 year ago

Many thanks @meridionaljet , this is a really nice improvement - I just added a couple of comments above, then I think we're close to merging it in!

meridionaljet commented 1 year ago

Requested tweaks by @iainrussell have been implemented

iainrussell commented 1 year ago

Thank you @meridionaljet ! I really like this improvement, and the fact that you added documentation, a test, and also a way to disable it in case of it being used as part of a long-running server. Thanks also for being patient with my suggestions and taking them on board, I think this solution is nice because it works 'out of the box' and does not have the risk of corrupted xarrays if the incoming GRIB file has multiple geometries. Thanks again!

martindurant commented 1 year ago

For kerchunk's use, we would really most like to simple not calculate coordinates at all, as we can store them elsewhere. If it were possible, then, to just skip the bytes that define the geometry to the actual measurements in a given message, all the better. Do you think this is possible?

iainrussell commented 1 year ago

Hi @martindurant, could you create a new issue for this use case please? It would be good to see an example of a GRIB file and how you would like the resulting xarray to look. It's not clear if you want to remove all the coordinates, including the time and vertical dimensions, and if this is for performance, memory or aesthetics. So if if it really would be useful, pop it in another issue and we can discuss there! Cheers, Iain

TAdeJong commented 10 months ago

Edit: cfgrib 0.9.11.0 incorporating these changes has now been released! :grinning:

~This pull_request greatly increases the speed of our workflow. However, installing from source is somewhat of a hassle. @iainrussell, I see you are recently doing work on this repository again. Are there plans for a new release soon? It would greatly help us, and I am sure a lot of other people using grib files and xarray 😄 .~

~(I couldn't really think of another place to ask this, so I hope this way is OK.)~

ecmwf / cfgrib

[New Implementation] Dramatically speed up dataset creation by caching geographic coordinates #341

Codecov Report