Restrictions on reusing an index file are too restrictive

Metamess commented 1 year ago

Is your feature request related to a problem? Please describe.

Very frequently, an index file created for a GRIB file is considered not valid by cfgrib when trying to open that GRIB file again in a future call. Users get the message "Ignoring index file {path} incompatible with GRIB file", without making it clear what exactly would have caused the index file to be incompatible. And in fact, in many if not most cases the index file would actually still be perfectly valid, it just isn't recognized as such.

Furthermore, on calls to open_datasets this message can appear many times as multiple calls to dataset.py::open_fileindex occur, and every time it prompts a new in-memory index to be generated, greatly slowing down the process of opening the file.

Describe the solution you'd like

A change to the way the compatibility of the index file is considered, to be more accommodating. Ideally, index file remain valid if a GRIB file is moved, and regardless of whether the path is relative or absolute. The index file should also remain valid as long as the required index_keys are a subset of the index_keys present in the index_file, and if the "errors" mode requested is at least as restrictive as the one of the index file.

Implementation details:

Currently, there are the following checks to consider the validity of an index file (in messages.py::FileIndex.from_indexpath_or_filestream):

The 'last modified' time of the index file most be later than the last modified time of the GRIB file
The list of index keys encoded in the index file must be the same as those currently passed as parameters
The FileStream instance that's part of the index file must be equal to the FileStream passed as parameter. This equality requires at least the same values for:
- The path to the GRIB file
- The "errors" attribute of the Filestreams
The index protocol version of the index file must be equal to the one hardcoded in messages.py::ALLOWED_PROTOCOL_VERSION

Additionally, unless a specific index path is provided, the index path is created based on the path to the GRIB file plus the short_hash of the index_keys. This means that in the majority of cases, the equality of the index_keys is effectively already verified, and the re-use of index files is also limited to calls with equal index_keys.

Instead of checking equality of the FileStream instances, we should check:

The equality of the basename of the path. This allows the GRIB file to be moved within the same filesystem (as this does not edit mtime), and it allows the use of both relative and absolute paths to the GRIB file. Given the default way to create the index path, moving the index file along with the GRIB would cause it to remain valid.
The "errors" parameter is actually a str Literal which can be one of "warn" (the default), "ignore" or "raise". Here, "raise" is the most strict, followed by "warn", and then "ignore". Index files created with an "errors" value that is more strict than the "errors" value of the current call will still be valid. Hence, if the value of "errors" of the existing index file is "raise" but the current call only requires "warn", the index file is suitable for the current needs.

The 'mtime' and index protocol version checks can likely remain as-is. However, if an index file is found at the expected location with an mtime earlier than the mtime of the GRIB file, that would imply that given this check, that index file will never be valid again. This means it would be better to just recreate it and overwrite the old index file, instead of skipping it and not storing a new, valid index file.

Additionally, and separate from the previously mentioned changes, the equality requirement of the index_keys could be relaxed. However, as this involves a potentially more breaking change, and is (probably?) way less frequently an occurring issue, this part is optional and could be disregarded:

Instead of checking the equality of the list of index keys, it should be checked if the currently required list is a subset of the index keys of the existing index file. More index_keys on the existing index just means it could differentiate more messages than required, but it can fully serve the needs of the current call.

To truly accommodate the use of such subsets of index keys, the default index path format would need to change, as it uses the hash of index_keys. This actually causes the greatest barrier to this change, as there is not clear replacement. It could be debated how often it really occurs that two index files are requested for the same file with differing sets of index_keys. This argument also directly argues against the necessity of a change related to index_keys.

A possible solution could be to drop the hash part of the index filename entirely, and just use {path}.idx as default instead. Then, when during the check for a match of index_keys it is found that the current call requires a key that is not present on the existing index file, a new index file should be created with the union of the keys of the current call and the existing file. This new index file would then replace the existing file at {path}.idx, effectively removing the need to keep multiple index file for various sets of index_keys.

If a user would, for whatever reason, like to keep separate index files based on different sets of index_keys, they can still get this behavior by passing the current default index path {path}.{short_hash}.idx as index_path parameter, which would result in the same behavior as the current implementation even if the aforementioned relaxation on index_keys requirements was implemented.

Describe alternatives you've considered

It would be ideal to be able to use a hash of the target GRIB file, instead of things like the path and the mtime. But hashing very large GRIB files would take a prohibitively long amount of time, making this approach infeasable.

Additional context

No response

Organisation

No response

Metamess commented 1 year ago

I am willing to try my hand at making a PR to implement this change, after some input on the proposed choices!

iainrussell commented 1 year ago

Hi @Metamess, thank you for your well considered suggestions! I'm in the process of writing a similarly considered response, but meetings etc are getting in the way! Basically I agree with most of what you say, but I'm trying to formulate what I think is the best solution. Should be able to post it soon.

iainrussell commented 1 year ago

My first thoughts on this, in bullet-point format:

I agree that the current behaviour is not quite right - if it finds an incompatible or old index file, it does not overwrite it with the new index, meaning that an old index file will effectively block the use of indexing for that GRIB file until it is deleted; I think it should always try to load the existing index file, and if it’s not compatible, overwrite it. But see below, maybe this situation should simply never need to occur, apart from when the index is older than the GRIB file.
I actually don’t see the point of checking the FileStream ‘errors’ member at all - ideally, we should use whatever the user requested when they called open_dataset(), we don’t care what was requested when the index file was generated.
In fact, there is also no point really in storing the GRIB file path (or filename) in the index file either - if we want to make the indexes work when they are moved along with their GRIB files, then we’d want to remove the path and just have the basename there. But given that the index filename will already have the same basename as the GRIB file, we don’t really need to have it stored in the contents of the index file at all! Effectively, we could remove the FileStream object from the index file. If we did this, then it would allow users to rename their GRIB files (and either rename the index files similarly, or specify the path to the original one in open_dataset()). So users could move or rename their GRIB files, e.g. as in #328. This places a little more trust in the users not to mess up their file names, for better or worse…! If we did this, then we should consider incrementing the protocol version. What do you think, is there a flaw in that logic?
Perhaps the protocol version should be included in the hash for the filename so that we don’t keep overwriting the same index file when switching versions of cfgrib (maybe an uncommon scenario, but possible); this would invalidate all existing index files, but we already have the case that these index files are sometimes ignored so it’s a one-off for the greater good
All of this ignores the slightly larger issue of using the index keys as a measure of compatibility. I agree that checking for the required subset of keys should be enough, but you rightly point out that this approach destroys the current scheme of putting the hash into the index file name because we would not be able to ‘find’ index files with different hashes (unless we scan the file system and try each matching index file). I think this might be a bigger change, and it raises more questions (e.g. when to overwrite the index file if we change the indexing keys)

So, reflecting on the above, taking backwards-compatibility into consideration, my current thinking is this:

if we find an index file that is older than the GRIB file, we should overwrite it with the new index; this could itself be a self-contained PR, as it is backwards-compatible and is effectively a bug fix, and we can merge it quickly
try taking the FileStream out of the index file completely; if we don’t increase the protocol version, we should still be able to make the new version of cfgrib work with old index files (by simply not checking the FileStream), but old versions of cfgrib would not work with new index files (a less likely scenario). So perhaps we do not need to change the protocol version, and should not add it to the hash for the index file
we don’t change how the index keys work for now
I’m very happy for you to have a go at this! :)

Metamess commented 1 year ago

Edited to add: I apologize for the fact that my replies seem to turn into essays 😅

Thanks for the response @iainrussell ! I think we can make this work. Especially if we loosen the requirements related to the FileStream match, I agree overwriting in case of an incompatible index file should always be fine. Since you proposed to go for a version without changing the way index keys work, I will make that my first attempt; though I still think we could solve it all in one go. Let me know what you think!

Effectively, we could remove the FileStream object from the index file. (...) What do you think, is there a flaw in that logic?

I agree that we don't need to use it for matching at all. The default behavior will find the index file because it is in the same folder with the same basename as the GRIB file that is being opened. Users providing their own indexpath remain responsible for providing a path to a matching index file, just like now. If someone messes up and renames an index file to make it match with the wrong GRIB file, that's on them. (I do assume attempting to use an invalid index file will lead to breaking errors somewhere along the way?).

However, the FileIndex and FieldsetIndex classes do make use of their fieldset attribute (the FileStream instance) to operate, so I don't think we can just remove the FileStream completely. Since (in this new setup) the files may have been moved since the index file was pickled, we would need a new FileStream instance every time we create the FileIndex instance from a file. Luckily it is already provided in each call that creates a FileIndex instance; simply always setting fieldset to the given FileStream instance would be enough, and I believe that approach would remain compatible with current index files.

If we decide to change the protocol version number (thus breaking backward compatibility anyway), perhaps we could even set the fieldset attribute to None before pickling, saving the need to store (and read in) a FileStream that is never going to be used anyway?
All of this ignores the slightly larger issue of using the index keys as a measure of compatibility. (...) it raises more questions (e.g. when to overwrite the index file if we change the indexing keys)

If I understand correctly, the index keys serve the purpose of enabling us to differentiate between the GRIB messages contained in the same file. The goal there is that the set of index keys is enough to uniquely identify each message (no two messages have the same values for every key). Once that is achieved, any extra keys will not change that original mapping, only add more ways to filter messages. If it is ever not the case that the index keys can uniquely identify messages (like in #344 ), the index is effectively insufficient, and extra keys would be required anyway for the index to achieve it's goal.

This would mean that adding more keys to the index can only 'improve' the index file; If less keys were sufficient, more keys will be too. If they were not sufficient, we should overwrite it with an index using more keys (all the original ones + any new ones). Hence we wouldn't even need different indexes with different sets of index keys, as we only ever need 1 index file per GRIB file. As such, we also wouldn't need a hash in the name of the index file, and can just use "{path}.idx" as the default indexpath.

Though if I am wrong here, and there is a scenario where adding more keys leads to different and undesired behavior, we might indeed want to leave the index keys alone for now. (I would be interested to hear about that case though!)
Perhaps the protocol version should be included in the hash for the filename (...) it’s a one-off for the greater good

Wouldn't this mean index files can never be backward or forward compatible though? Say protocol version 4 is released, but it can work with version 3 index files; If the version number is part of the hash, it would never manage to match with compatible version 3 index file (unless it would specifically check for their existence, which I don't think would be the way to go?).

I don't know how common the use case of continued switching between cfgrib versions would be, but to me it sounds pretty rare, and it might move into "supply your own index path" territory. But, technically, any change to the index filename scheme would suffice to prevent overwrites, it needn't be the addition of the protocol version in the keys-hash. Using backwards/forwards compatibility would then always require some form of file search for matches though.

Say we were to do this, perhaps we could add the protocol version to the file extension (my_file.grib.923a8.idx2) (assuming protocol version numbers are integers), or add the hash of the protocol version separately (my_file.923a8.af1136.idx) if you want to more neatly allow for arbitrary values for protocol version.

FWIW, if we do merely drop the hash (as proposed at the previous point) and call this new approach protocol version 3, we would at least avoid overwriting version 2 index files, even without mentioning the version in the filename.
if we don’t increase the protocol version, we should still be able to make the new version of cfgrib work with old index files

I can't judge when a change to the protocol version number is or isn't warranted, but this could all still work by being explicit about the compatibility when performing the check. Rather than checking protocol number equality, add the code that compatible_versions = {"3": ["2", "3"]} and then check if getattr(self, "index_protocol_version", None) in compatible_versions[ALLOWED_PROTOCOL_VERSION]. This does assume no changes in the default indexpath between versions though (more code would be required if you want it to try to find version n-1 index files if there is no version n file is found).
Lastly, there was one more common index-file-invalidation case I wanted to explicitly address: GRIB files (and their index files) may move between systems (downloads to local machines from some central storage for example). Due to the index_mtime >= filestream_mtime` check, if the GRIB file gets copied before the index file, the index file would remain valid, but if the index file happens to get copied just a fraction of a second earlier, it is invalidated. Now, with the new always-overwrite approach, this would be 'fixed' on the first read at the cost of recreating the index file; but this is technically an undesirable and unneeded invalidation.

Sadly I don't know of a perfect alternative. I believe (?) the sole purpose of the mtime check is to invalidate the index file in the case of edits to the GRIB file. Checking the GRIB file size might be an OK alternative (especially given that the index file stores byte offsets, filesize changes should probably invalidate index files), except that (depending on any compression) minor edits to the GRIB file might not change the file size, and neither would changing the internal order of GRIB messages.

It is effectively a trade-off: How often do 1) GRIB edits happen, that 2) don't alter the file size, but 3) do invalidate the index (the value of and index key property is changed, or the byte offset is changed), vs: How often do 1) index files get copied to another filesystem, where 2) the index file gets an mtime earlier than that of the accompanying GRIB file. I expect the latter to occur more than the former, but the latter can be auto-fixed with relative ease, whereas the former would not be so straight forward. So I'm not sure what the vision would be on this?

Note that I'm also assuming here that copying between systems does not alter file size, which may or may not hold. If it doesn't, that in itself might make the index file invalid, so perhaps file size should always be added as a check anyway?

steph-ben commented 1 year ago

Thanks for your messages. I encounter this slowiness as well, when trying to load large amount of GRIBs files in xarray.

Maybe dummy question : why cfgrib didn't rely on eccodes codes_index_create() function ?

From my limited usage, it seems to perform well, manage hetereogenous files, and index can be pre-generated using grib_index_build

pedroaugustosmribeiro commented 10 months ago

I agree

ecmwf / cfgrib