Closed meridionaljet closed 1 year ago
Thank you for your excellent work, it is quite an impressive improvement.
However, as you can see not all the tests have passed. I suggest you move your test into test_40_xarray_store.py, which is already dealing with the backend_kwargs options, then you do not need:
from cfgrib import open_dataset
and can simply use
xarray_store.open_dataset(...)
to open your data.
Patch coverage: 100.00
% and project coverage change: +0.03
:tada:
Comparison is base (
3951355
) 95.44% compared to head (1ce00c1
) 95.47%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
Hello - any chance of merging this soon? Looks like new conflicts with master
are being introduced as this sits.
Hello - any chance of merging this soon? Looks like new conflicts with
master
are being introduced as this sits.
Hi @meridionaljet , have the concerns from @sandorkertesz been addressed? I see that there is currently a conflict. Can you please fix it? This PR may need to be first merged to develop branch, but I leave that for @sandorkertesz to decide.
Hello - any chance of merging this soon? Looks like new conflicts with
master
are being introduced as this sits.Hi @meridionaljet , have the concerns from @sandorkertesz been addressed? I see that there is currently a conflict. Can you please fix it? This PR may need to be first merged to develop branch, but I leave that for @sandorkertesz to decide.
Yes, the commits on April 18th addressed what @sandorkertesz had said. I just fixed the new conflict due to an update to master
- should be good to go now unless you guys want to adjust the implementation.
Hi @meridionaljet,
Thank you for your contribution, it clearly makes a big difference in these cases! I have a concern that it opens up possibilities for inexperienced users to use this feature and end up with a dataset that is incorrect (it could happen if the GRIB file contains fields with different geometry, but the user uses this feature anyway, unaware that they will break things, and I'm not sure the error would be easy for them to track down).
I'd be interested to know what you think about an alternative approach, which would be to do this automatically: store the geometry for the first field, then, for each subsequent field, check if the geometry section is the same as the first one (for GRIB2, this can be checked via the key 'md5Section3'; for GRIB1 it would be 'md5Section2'). If they are the same, then re-use the stored geometry. We can then decide whether to always compare against the first field, or if we store the geometry of the last field whose geometry changed. This approach would work without the user having to do anything, and would not, I think, be prone to accidental errors. The only disadvantage would be that you could not store the geometry offline, but I'd be worried about users using 'wrong' geometry files if they stored it offline anyway.
Cheers, Iain
Hi @meridionaljet,
Thank you for your contribution, it clearly makes a big difference in these cases! I have a concern that it opens up possibilities for inexperienced users to use this feature and end up with a dataset that is incorrect (it could happen if the GRIB file contains fields with different geometry, but the user uses this feature anyway, unaware that they will break things, and I'm not sure the error would be easy for them to track down).
I'd be interested to know what you think about an alternative approach, which would be to do this automatically: store the geometry for the first field, then, for each subsequent field, check if the geometry section is the same as the first one (for GRIB2, this can be checked via the key 'md5Section3'; for GRIB1 it would be 'md5Section2'). If they are the same, then re-use the stored geometry. We can then decide whether to always compare against the first field, or if we store the geometry of the last field whose geometry changed. This approach would work without the user having to do anything, and would not, I think, be prone to accidental errors. The only disadvantage would be that you could not store the geometry offline, but I'd be worried about users using 'wrong' geometry files if they stored it offline anyway.
Cheers, Iain
I took a look at the feasibility of this. For cfgrib.open_dataset()
, it is straight-forward to implement a cache inside cfgrib.dataset.build_dataset_components()
.
However, for cfgrib.open_datasets()
, build_dataset_components()
is called multiple times, and the execution path of open_datasets()
seems to run through xarray_store.open_dataset()
. This would require the cache to be implemented globally, which is easy enough, but in turn would require some sort of state management to avoid the cache growing unboundedly in long-lived applications which open many different grid geometries. A possible idea is to utilize functools.lru_cache
somehow to restrict the number of entries in the global cache, though this doesn't have knowledge of the memory size of each entry, so it is difficult to pick an appropriate number.
That said, it's possible we could assume reasonably that the geometry alone is small enough to be unconcerned about for 99.9% of workflows. I suppose we could also make geocaching an optional feature enabled/disabled by the user as a backend_kwarg.
In light of these observations, what do you think about this approach?
@iainrussell @tlmquintino @sandorkertesz I have posted an alternative (and I believe better) implementation at #341 .
This is different enough that I felt it warranted a separate PR to avoid confusion. Please let me know what you think!
This addresses a massive performance bottleneck when opening a GRIB file as an
xarray
dataset. Currently,cfgrib
callscfgrib.dataset.build_geography_coordinates()
for every parameter in the index when creating a dataset. Each call requireseccodes
'sgrib_get_array
to be called, which reads coordinate arrays from disk. This is prohibitively expensive for large files with many records, and often unnecessary since such files typically have identical grids for each record.This pull request introduces the option for the user to include pre-computed geographic coordinate data in
backend_kwargs
when callingcfgrib.open_dataset()
orcfgrib.open_datasets()
in cases where the user knows that their GRIB file has identical geometry for each parameter. The utility functioncfgrib.dataset.get_first_geo_coords()
has been added, which computes coordinates for the first record in the index. The user can then use the output as follows:geocoords
could be serialized and reused by the user for multiple files with identical geometry, requiring zero calls tobuild_geography_coordinates()
once computed for the first time.This approach reduces the
cfgrib.open_dataset()
time for a 262MB HRRR file from NCEP from 3.4 seconds to 45 milliseconds on my machine. If the full 400MB HRRR file with 43 different hypercube types is opened withcfgrib.open_datasets()
, the time taken is reduced from 38 seconds to 2 seconds. This thus results in a speedup of 1-2 orders of magnitude, depending on the size of the file and the number of unique hypercubes.A basic test has been added to the test suite as part of this pull request.
I have not contributed to
cfgrib
before, but as far as I can tell, adding this optional feature cannot break anything unless the user employs this option on a file for which it is inappropriate, and such files are uncommon. The speedup offered releases a significant bottleneck in data processing workflows usingxarray
andcfgrib
, especially for large files, makingxarray
dataset creation for GRIB almost as cheap as it is for other data formats likeNetCDF
andzarr
.