Dramatically speed up dataset creation by pre-computing geographic coordinates

meridionaljet commented 1 year ago

This addresses a massive performance bottleneck when opening a GRIB file as an xarray dataset. Currently, cfgrib calls cfgrib.dataset.build_geography_coordinates() for every parameter in the index when creating a dataset. Each call requires eccodes's grib_get_array to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and often unnecessary since such files typically have identical grids for each record.

This pull request introduces the option for the user to include pre-computed geographic coordinate data in backend_kwargs when calling cfgrib.open_dataset() or cfgrib.open_datasets() in cases where the user knows that their GRIB file has identical geometry for each parameter. The utility function cfgrib.dataset.get_first_geo_coords() has been added, which computes coordinates for the first record in the index. The user can then use the output as follows:

geocoords = cfgrib.dataset.get_first_geo_coords(filename)
ds = cfgrib.open_dataset(filename, backend_kwargs=dict(precomputed_geo_coords=geocoords))

geocoords could be serialized and reused by the user for multiple files with identical geometry, requiring zero calls to build_geography_coordinates() once computed for the first time.

This approach reduces the cfgrib.open_dataset() time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened with cfgrib.open_datasets(), the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.

A basic test has been added to the test suite as part of this pull request.

I have not contributed to cfgrib before, but as far as I can tell, adding this optional feature cannot break anything unless the user employs this option on a file for which it is inappropriate, and such files are uncommon. The speedup offered releases a significant bottleneck in data processing workflows using xarray and cfgrib , especially for large files, making xarray dataset creation for GRIB almost as cheap as it is for other data formats like NetCDF and zarr.

FussyDuck commented 1 year ago

All committers have signed the CLA.

sandorkertesz commented 1 year ago

Thank you for your excellent work, it is quite an impressive improvement.

However, as you can see not all the tests have passed. I suggest you move your test into test_40_xarray_store.py, which is already dealing with the backend_kwargs options, then you do not need:

from cfgrib import open_dataset

and can simply use

xarray_store.open_dataset(...)

to open your data.

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.03 :tada:

Comparison is base (3951355) 95.44% compared to head (1ce00c1) 95.47%.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #338 +/- ## ========================================== + Coverage 95.44% 95.47% +0.03% ========================================== Files 26 26 Lines 2040 2057 +17 Branches 236 237 +1 ========================================== + Hits 1947 1964 +17 Misses 62 62 Partials 31 31 ``` | [Impacted Files](https://codecov.io/gh/ecmwf/cfgrib/pull/338?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf) | Coverage Δ | | |---|---|---| | [cfgrib/xarray\_plugin.py](https://codecov.io/gh/ecmwf/cfgrib/pull/338?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-Y2ZncmliL3hhcnJheV9wbHVnaW4ucHk=) | `88.40% <ø> (ø)` | | | [cfgrib/dataset.py](https://codecov.io/gh/ecmwf/cfgrib/pull/338?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-Y2ZncmliL2RhdGFzZXQucHk=) | `98.45% <100.00%> (+0.05%)` | :arrow_up: | | [tests/test\_40\_xarray\_store.py](https://codecov.io/gh/ecmwf/cfgrib/pull/338?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf#diff-dGVzdHMvdGVzdF80MF94YXJyYXlfc3RvcmUucHk=) | `100.00% <100.00%> (ø)` | | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ecmwf)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

meridionaljet commented 1 year ago

Hello - any chance of merging this soon? Looks like new conflicts with master are being introduced as this sits.

tlmquintino commented 1 year ago

Hello - any chance of merging this soon? Looks like new conflicts with master are being introduced as this sits.

Hi @meridionaljet , have the concerns from @sandorkertesz been addressed? I see that there is currently a conflict. Can you please fix it? This PR may need to be first merged to develop branch, but I leave that for @sandorkertesz to decide.

meridionaljet commented 1 year ago

Hello - any chance of merging this soon? Looks like new conflicts with master are being introduced as this sits.

Hi @meridionaljet , have the concerns from @sandorkertesz been addressed? I see that there is currently a conflict. Can you please fix it? This PR may need to be first merged to develop branch, but I leave that for @sandorkertesz to decide.

Yes, the commits on April 18th addressed what @sandorkertesz had said. I just fixed the new conflict due to an update to master - should be good to go now unless you guys want to adjust the implementation.

iainrussell commented 1 year ago

Hi @meridionaljet,

Thank you for your contribution, it clearly makes a big difference in these cases! I have a concern that it opens up possibilities for inexperienced users to use this feature and end up with a dataset that is incorrect (it could happen if the GRIB file contains fields with different geometry, but the user uses this feature anyway, unaware that they will break things, and I'm not sure the error would be easy for them to track down).

I'd be interested to know what you think about an alternative approach, which would be to do this automatically: store the geometry for the first field, then, for each subsequent field, check if the geometry section is the same as the first one (for GRIB2, this can be checked via the key 'md5Section3'; for GRIB1 it would be 'md5Section2'). If they are the same, then re-use the stored geometry. We can then decide whether to always compare against the first field, or if we store the geometry of the last field whose geometry changed. This approach would work without the user having to do anything, and would not, I think, be prone to accidental errors. The only disadvantage would be that you could not store the geometry offline, but I'd be worried about users using 'wrong' geometry files if they stored it offline anyway.

Cheers, Iain

meridionaljet commented 1 year ago

Hi @meridionaljet,

Thank you for your contribution, it clearly makes a big difference in these cases! I have a concern that it opens up possibilities for inexperienced users to use this feature and end up with a dataset that is incorrect (it could happen if the GRIB file contains fields with different geometry, but the user uses this feature anyway, unaware that they will break things, and I'm not sure the error would be easy for them to track down).

I'd be interested to know what you think about an alternative approach, which would be to do this automatically: store the geometry for the first field, then, for each subsequent field, check if the geometry section is the same as the first one (for GRIB2, this can be checked via the key 'md5Section3'; for GRIB1 it would be 'md5Section2'). If they are the same, then re-use the stored geometry. We can then decide whether to always compare against the first field, or if we store the geometry of the last field whose geometry changed. This approach would work without the user having to do anything, and would not, I think, be prone to accidental errors. The only disadvantage would be that you could not store the geometry offline, but I'd be worried about users using 'wrong' geometry files if they stored it offline anyway.

Cheers, Iain

I took a look at the feasibility of this. For cfgrib.open_dataset(), it is straight-forward to implement a cache inside cfgrib.dataset.build_dataset_components().

However, for cfgrib.open_datasets(), build_dataset_components() is called multiple times, and the execution path of open_datasets() seems to run through xarray_store.open_dataset(). This would require the cache to be implemented globally, which is easy enough, but in turn would require some sort of state management to avoid the cache growing unboundedly in long-lived applications which open many different grid geometries. A possible idea is to utilize functools.lru_cache somehow to restrict the number of entries in the global cache, though this doesn't have knowledge of the memory size of each entry, so it is difficult to pick an appropriate number.

That said, it's possible we could assume reasonably that the geometry alone is small enough to be unconcerned about for 99.9% of workflows. I suppose we could also make geocaching an optional feature enabled/disabled by the user as a backend_kwarg.

In light of these observations, what do you think about this approach?

meridionaljet commented 1 year ago

@iainrussell @tlmquintino @sandorkertesz I have posted an alternative (and I believe better) implementation at #341 .

This is different enough that I felt it warranted a separate PR to avoid confusion. Please let me know what you think!

ecmwf / cfgrib

Dramatically speed up dataset creation by pre-computing geographic coordinates #338

Codecov Report