ecmwf / cfgrib

A Python interface to map GRIB files to the NetCDF Common Data Model following the CF Convention using ecCodes
Apache License 2.0
409 stars 77 forks source link

Restrictions on reusing an index file are too restrictive #350

Open Metamess opened 1 year ago

Metamess commented 1 year ago

Is your feature request related to a problem? Please describe.

Very frequently, an index file created for a GRIB file is considered not valid by cfgrib when trying to open that GRIB file again in a future call. Users get the message "Ignoring index file {path} incompatible with GRIB file", without making it clear what exactly would have caused the index file to be incompatible. And in fact, in many if not most cases the index file would actually still be perfectly valid, it just isn't recognized as such.

Furthermore, on calls to open_datasets this message can appear many times as multiple calls to dataset.py::open_fileindex occur, and every time it prompts a new in-memory index to be generated, greatly slowing down the process of opening the file.

Describe the solution you'd like

A change to the way the compatibility of the index file is considered, to be more accommodating. Ideally, index file remain valid if a GRIB file is moved, and regardless of whether the path is relative or absolute. The index file should also remain valid as long as the required index_keys are a subset of the index_keys present in the index_file, and if the "errors" mode requested is at least as restrictive as the one of the index file.

Implementation details:

Currently, there are the following checks to consider the validity of an index file (in messages.py::FileIndex.from_indexpath_or_filestream):

Additionally, unless a specific index path is provided, the index path is created based on the path to the GRIB file plus the short_hash of the index_keys. This means that in the majority of cases, the equality of the index_keys is effectively already verified, and the re-use of index files is also limited to calls with equal index_keys.

Instead of checking equality of the FileStream instances, we should check:

The 'mtime' and index protocol version checks can likely remain as-is. However, if an index file is found at the expected location with an mtime earlier than the mtime of the GRIB file, that would imply that given this check, that index file will never be valid again. This means it would be better to just recreate it and overwrite the old index file, instead of skipping it and not storing a new, valid index file.

Additionally, and separate from the previously mentioned changes, the equality requirement of the index_keys could be relaxed. However, as this involves a potentially more breaking change, and is (probably?) way less frequently an occurring issue, this part is optional and could be disregarded:

Instead of checking the equality of the list of index keys, it should be checked if the currently required list is a subset of the index keys of the existing index file. More index_keys on the existing index just means it could differentiate more messages than required, but it can fully serve the needs of the current call.

To truly accommodate the use of such subsets of index keys, the default index path format would need to change, as it uses the hash of index_keys. This actually causes the greatest barrier to this change, as there is not clear replacement. It could be debated how often it really occurs that two index files are requested for the same file with differing sets of index_keys. This argument also directly argues against the necessity of a change related to index_keys.

A possible solution could be to drop the hash part of the index filename entirely, and just use {path}.idx as default instead. Then, when during the check for a match of index_keys it is found that the current call requires a key that is not present on the existing index file, a new index file should be created with the union of the keys of the current call and the existing file. This new index file would then replace the existing file at {path}.idx, effectively removing the need to keep multiple index file for various sets of index_keys.

If a user would, for whatever reason, like to keep separate index files based on different sets of index_keys, they can still get this behavior by passing the current default index path {path}.{short_hash}.idx as index_path parameter, which would result in the same behavior as the current implementation even if the aforementioned relaxation on index_keys requirements was implemented.

Describe alternatives you've considered

It would be ideal to be able to use a hash of the target GRIB file, instead of things like the path and the mtime. But hashing very large GRIB files would take a prohibitively long amount of time, making this approach infeasable.

Additional context

No response

Organisation

No response

Metamess commented 1 year ago

I am willing to try my hand at making a PR to implement this change, after some input on the proposed choices!

iainrussell commented 1 year ago

Hi @Metamess, thank you for your well considered suggestions! I'm in the process of writing a similarly considered response, but meetings etc are getting in the way! Basically I agree with most of what you say, but I'm trying to formulate what I think is the best solution. Should be able to post it soon.

iainrussell commented 1 year ago

My first thoughts on this, in bullet-point format:

So, reflecting on the above, taking backwards-compatibility into consideration, my current thinking is this:

Metamess commented 1 year ago

Edited to add: I apologize for the fact that my replies seem to turn into essays 😅

Thanks for the response @iainrussell ! I think we can make this work. Especially if we loosen the requirements related to the FileStream match, I agree overwriting in case of an incompatible index file should always be fine. Since you proposed to go for a version without changing the way index keys work, I will make that my first attempt; though I still think we could solve it all in one go. Let me know what you think!

steph-ben commented 1 year ago

Thanks for your messages. I encounter this slowiness as well, when trying to load large amount of GRIBs files in xarray.

Maybe dummy question : why cfgrib didn't rely on eccodes codes_index_create() function ?

From my limited usage, it seems to perform well, manage hetereogenous files, and index can be pre-generated using grib_index_build

pedroaugustosmribeiro commented 10 months ago

I agree