Open Metamess opened 1 year ago
I am willing to try my hand at making a PR to implement this change, after some input on the proposed choices!
Hi @Metamess, thank you for your well considered suggestions! I'm in the process of writing a similarly considered response, but meetings etc are getting in the way! Basically I agree with most of what you say, but I'm trying to formulate what I think is the best solution. Should be able to post it soon.
My first thoughts on this, in bullet-point format:
So, reflecting on the above, taking backwards-compatibility into consideration, my current thinking is this:
Edited to add: I apologize for the fact that my replies seem to turn into essays 😅
Thanks for the response @iainrussell ! I think we can make this work. Especially if we loosen the requirements related to the FileStream match, I agree overwriting in case of an incompatible index file should always be fine. Since you proposed to go for a version without changing the way index keys work, I will make that my first attempt; though I still think we could solve it all in one go. Let me know what you think!
Effectively, we could remove the FileStream object from the index file. (...) What do you think, is there a flaw in that logic?
I agree that we don't need to use it for matching at all. The default behavior will find the index file because it is in the same folder with the same basename as the GRIB file that is being opened. Users providing their own indexpath remain responsible for providing a path to a matching index file, just like now. If someone messes up and renames an index file to make it match with the wrong GRIB file, that's on them. (I do assume attempting to use an invalid index file will lead to breaking errors somewhere along the way?).
However, the FileIndex and FieldsetIndex classes do make use of their fieldset
attribute (the FileStream instance) to operate, so I don't think we can just remove the FileStream completely. Since (in this new setup) the files may have been moved since the index file was pickled, we would need a new FileStream instance every time we create the FileIndex instance from a file. Luckily it is already provided in each call that creates a FileIndex instance; simply always setting fieldset
to the given FileStream instance would be enough, and I believe that approach would remain compatible with current index files.
If we decide to change the protocol version number (thus breaking backward compatibility anyway), perhaps we could even set the fieldset
attribute to None before pickling, saving the need to store (and read in) a FileStream that is never going to be used anyway?
All of this ignores the slightly larger issue of using the index keys as a measure of compatibility. (...) it raises more questions (e.g. when to overwrite the index file if we change the indexing keys)
If I understand correctly, the index keys serve the purpose of enabling us to differentiate between the GRIB messages contained in the same file. The goal there is that the set of index keys is enough to uniquely identify each message (no two messages have the same values for every key). Once that is achieved, any extra keys will not change that original mapping, only add more ways to filter messages. If it is ever not the case that the index keys can uniquely identify messages (like in #344 ), the index is effectively insufficient, and extra keys would be required anyway for the index to achieve it's goal.
This would mean that adding more keys to the index can only 'improve' the index file; If less keys were sufficient, more keys will be too. If they were not sufficient, we should overwrite it with an index using more keys (all the original ones + any new ones). Hence we wouldn't even need different indexes with different sets of index keys, as we only ever need 1 index file per GRIB file. As such, we also wouldn't need a hash in the name of the index file, and can just use "{path}.idx"
as the default indexpath.
Though if I am wrong here, and there is a scenario where adding more keys leads to different and undesired behavior, we might indeed want to leave the index keys alone for now. (I would be interested to hear about that case though!)
Perhaps the protocol version should be included in the hash for the filename (...) it’s a one-off for the greater good
Wouldn't this mean index files can never be backward or forward compatible though? Say protocol version 4 is released, but it can work with version 3 index files; If the version number is part of the hash, it would never manage to match with compatible version 3 index file (unless it would specifically check for their existence, which I don't think would be the way to go?).
I don't know how common the use case of continued switching between cfgrib versions would be, but to me it sounds pretty rare, and it might move into "supply your own index path" territory. But, technically, any change to the index filename scheme would suffice to prevent overwrites, it needn't be the addition of the protocol version in the keys-hash. Using backwards/forwards compatibility would then always require some form of file search for matches though.
Say we were to do this, perhaps we could add the protocol version to the file extension (my_file.grib.923a8.idx2
) (assuming protocol version numbers are integers), or add the hash of the protocol version separately (my_file.923a8.af1136.idx
) if you want to more neatly allow for arbitrary values for protocol version.
FWIW, if we do merely drop the hash (as proposed at the previous point) and call this new approach protocol version 3, we would at least avoid overwriting version 2 index files, even without mentioning the version in the filename.
if we don’t increase the protocol version, we should still be able to make the new version of cfgrib work with old index files
I can't judge when a change to the protocol version number is or isn't warranted, but this could all still work by being explicit about the compatibility when performing the check. Rather than checking protocol number equality, add the code that compatible_versions = {"3": ["2", "3"]}
and then check if getattr(self, "index_protocol_version", None) in compatible_versions[ALLOWED_PROTOCOL_VERSION]
. This does assume no changes in the default indexpath between versions though (more code would be required if you want it to try to find version n-1 index files if there is no version n file is found).
Lastly, there was one more common index-file-invalidation case I wanted to explicitly address: GRIB files (and their index files) may move between systems (downloads to local machines from some central storage for example). Due to the index_mtime >= filestream_mtime` check, if the GRIB file gets copied before the index file, the index file would remain valid, but if the index file happens to get copied just a fraction of a second earlier, it is invalidated. Now, with the new always-overwrite approach, this would be 'fixed' on the first read at the cost of recreating the index file; but this is technically an undesirable and unneeded invalidation.
Sadly I don't know of a perfect alternative. I believe (?) the sole purpose of the mtime check is to invalidate the index file in the case of edits to the GRIB file. Checking the GRIB file size might be an OK alternative (especially given that the index file stores byte offsets, filesize changes should probably invalidate index files), except that (depending on any compression) minor edits to the GRIB file might not change the file size, and neither would changing the internal order of GRIB messages.
It is effectively a trade-off: How often do 1) GRIB edits happen, that 2) don't alter the file size, but 3) do invalidate the index (the value of and index key property is changed, or the byte offset is changed), vs: How often do 1) index files get copied to another filesystem, where 2) the index file gets an mtime earlier than that of the accompanying GRIB file. I expect the latter to occur more than the former, but the latter can be auto-fixed with relative ease, whereas the former would not be so straight forward. So I'm not sure what the vision would be on this?
Note that I'm also assuming here that copying between systems does not alter file size, which may or may not hold. If it doesn't, that in itself might make the index file invalid, so perhaps file size should always be added as a check anyway?
Thanks for your messages. I encounter this slowiness as well, when trying to load large amount of GRIBs files in xarray.
Maybe dummy question : why cfgrib didn't rely on eccodes codes_index_create() function ?
From my limited usage, it seems to perform well, manage hetereogenous files, and index can be pre-generated using grib_index_build
I agree
Is your feature request related to a problem? Please describe.
Very frequently, an index file created for a GRIB file is considered not valid by cfgrib when trying to open that GRIB file again in a future call. Users get the message
"Ignoring index file {path} incompatible with GRIB file"
, without making it clear what exactly would have caused the index file to be incompatible. And in fact, in many if not most cases the index file would actually still be perfectly valid, it just isn't recognized as such.Furthermore, on calls to
open_datasets
this message can appear many times as multiple calls todataset.py::open_fileindex
occur, and every time it prompts a new in-memory index to be generated, greatly slowing down the process of opening the file.Describe the solution you'd like
A change to the way the compatibility of the index file is considered, to be more accommodating. Ideally, index file remain valid if a GRIB file is moved, and regardless of whether the path is relative or absolute. The index file should also remain valid as long as the required index_keys are a subset of the index_keys present in the index_file, and if the "errors" mode requested is at least as restrictive as the one of the index file.
Implementation details:
Currently, there are the following checks to consider the validity of an index file (in
messages.py::FileIndex.from_indexpath_or_filestream
):messages.py::ALLOWED_PROTOCOL_VERSION
Additionally, unless a specific index path is provided, the index path is created based on the path to the GRIB file plus the short_hash of the index_keys. This means that in the majority of cases, the equality of the index_keys is effectively already verified, and the re-use of index files is also limited to calls with equal index_keys.
Instead of checking equality of the FileStream instances, we should check:
The 'mtime' and index protocol version checks can likely remain as-is. However, if an index file is found at the expected location with an mtime earlier than the mtime of the GRIB file, that would imply that given this check, that index file will never be valid again. This means it would be better to just recreate it and overwrite the old index file, instead of skipping it and not storing a new, valid index file.
Additionally, and separate from the previously mentioned changes, the equality requirement of the index_keys could be relaxed. However, as this involves a potentially more breaking change, and is (probably?) way less frequently an occurring issue, this part is optional and could be disregarded:
Instead of checking the equality of the list of index keys, it should be checked if the currently required list is a subset of the index keys of the existing index file. More index_keys on the existing index just means it could differentiate more messages than required, but it can fully serve the needs of the current call.
To truly accommodate the use of such subsets of index keys, the default index path format would need to change, as it uses the hash of index_keys. This actually causes the greatest barrier to this change, as there is not clear replacement. It could be debated how often it really occurs that two index files are requested for the same file with differing sets of index_keys. This argument also directly argues against the necessity of a change related to index_keys.
A possible solution could be to drop the hash part of the index filename entirely, and just use
{path}.idx
as default instead. Then, when during the check for a match of index_keys it is found that the current call requires a key that is not present on the existing index file, a new index file should be created with the union of the keys of the current call and the existing file. This new index file would then replace the existing file at{path}.idx
, effectively removing the need to keep multiple index file for various sets of index_keys.If a user would, for whatever reason, like to keep separate index files based on different sets of index_keys, they can still get this behavior by passing the current default index path
{path}.{short_hash}.idx
as index_path parameter, which would result in the same behavior as the current implementation even if the aforementioned relaxation on index_keys requirements was implemented.Describe alternatives you've considered
It would be ideal to be able to use a hash of the target GRIB file, instead of things like the path and the mtime. But hashing very large GRIB files would take a prohibitively long amount of time, making this approach infeasable.
Additional context
No response
Organisation
No response