Closed unkcpz closed 7 months ago
The policy with caching has always been to err on the safe side. This is why the plugin version is included. It is unlikely that in general this will ever have an effect (since anyway all attributes and repo contents are included) but we decided to keep it in.
But I think I have to agree, that really all we care about for a UpfData node is the hash of the file. So really we could even think of overriding the _get_objects_to_hash
and just return the md5
. I am thinking hard whether there are edge cases where this could go wrong, but I am coming up blank. @mbercx do you see a potential problem with this?
I am thinking hard whether there are edge cases where this could go wrong, but I am coming up blank. @mbercx do you see a potential problem with this?
I'm coming up blank as well. And having the plugin version used for the hash is kind of tedious, breaking caching when it shouldn't. So fine for me to change this!
I was thinking of adding to PseudoPotentialData
, is that too aggressive?
For safety, just add it to UpfData? If we agree the plugin version is too tedious here, maybe it can added to PseudoPotentialData
and ignore the filename for UpfData
.
I can make a PR for it.
I think the argument extends to all subclasses of PseudoPotentialData
. There is a reason that it subclasses SinglefileData
because (in principle) the file content should uniquely determine the pseudo. The metadata that we add in the attributes are just useful shortcuts to facilitate querying and proxy some common properties. They all come from the file, so their content is redundant.
I think that if we change this, we better change it for all to be consistent, and so put the change on PseudoPotentialData
. And as mentioned before, instead of overriding _hash_ignored_attributes
, I think we should simply override _get_objects_to_hash
straight away and just have it return the hash of the file.
Besides the version of the plugin, I find a more worried issue that the calcjob node hash is changed if the aiida-core is updated. I think the version of the calcjob node should be ignored.
Here is a output of _get_objects_to_hash
from a pw calculation. The hash is based on the aiida-core version that runs the calculation, which means everytime the aiida-core updated the rehash is needed, which is not expected I assume, if look at the operation needed in https://github.com/aiidateam/aiida-core/blob/main/CHANGELOG.md#v240---2023-06-22
['2.4.0',
{'version': {'core': '2.4.0.post0', 'plugin': '4.4.0'},
'withmpi': True,
'resources': {'num_machines': 1, 'num_mpiprocs_per_machine': 36},
'append_text': '',
'parser_name': 'quantumespresso.pw',
'prepend_text': '',
'input_filename': 'aiida.in',
...
Besides the version of the plugin, I find a more worried issue that the calcjob node hash is changed if the aiida-core is updated. I think the version of the calcjob node should be ignored.
You are right, the first element is the version of the package that defines the module of the node type, e.g., CalcJobNode
and WorkChainNode
. Since these can not be changed through plugins, this is always the version of aiida-core
. It is a bit redundant actually, because it is anyway included in the version
attribute. The reason these versions are included goes again back to what I mentioned before: when we implmented caching, we implemented it to be as conservative as possible. It looks like instead of monkeypatching things here in aiida-pseudo
we should open a discussion with the aiida-core
team to see if we can remove aiida-core and plugins versions from the hashes.
The hash is based on the aiida-core version that runs the calculation, which means everytime the aiida-core updated the rehash is needed, which is not expected I assume, if look at the operation needed in https://github.com/aiidateam/aiida-core/blob/main/CHANGELOG.md#v240---2023-06-22
Us forcing a database migration to drop the hashes in that case was different I would argue. There was a real bug in the calculation of the hash, which we fixed, but this would therefore invalidate the hash. The reason is not so much to make sure identical nodes are still matched, but rather that different nodes are not accidentally matched because the original hashes are incorrect. I guess that probability is really low.
A typical UpfData node has the following objects used to generate the hash:
I think the filename and the version of the plugin are relevant to the calculation result thus need to add to
_hash_ignored_attributes
. Your opinion @sphuber ?