aiidateam / aiida-pseudo

MIT License
5 stars 8 forks source link

Ignore version and filename when compute hash for the pseudo node #167

Closed unkcpz closed 7 months ago

unkcpz commented 8 months ago

A typical UpfData node has the following objects used to generate the hash:

['1.4.0',                                                                                                                                    
 {'md5': 'e96aa7e3e4a958db16958554ea960e2a',                                                                                                     
  'element': 'H',                                                                                                                                
  'filename': 'H.nc.z_1.oncvpsp3.dojo.v0.5.0-std.upf',                                                                                       
  'z_valence': 1},                                                      
 '2ad1f9704b3b12e280562806303ce715cbd9f4735c24b047199cbbcd47e168e8',    
 None] 

I think the filename and the version of the plugin are relevant to the calculation result thus need to add to _hash_ignored_attributes. Your opinion @sphuber ?

sphuber commented 8 months ago

The policy with caching has always been to err on the safe side. This is why the plugin version is included. It is unlikely that in general this will ever have an effect (since anyway all attributes and repo contents are included) but we decided to keep it in.

But I think I have to agree, that really all we care about for a UpfData node is the hash of the file. So really we could even think of overriding the _get_objects_to_hash and just return the md5. I am thinking hard whether there are edge cases where this could go wrong, but I am coming up blank. @mbercx do you see a potential problem with this?

mbercx commented 8 months ago

I am thinking hard whether there are edge cases where this could go wrong, but I am coming up blank. @mbercx do you see a potential problem with this?

I'm coming up blank as well. And having the plugin version used for the hash is kind of tedious, breaking caching when it shouldn't. So fine for me to change this!

unkcpz commented 8 months ago

I was thinking of adding to PseudoPotentialData, is that too aggressive? For safety, just add it to UpfData? If we agree the plugin version is too tedious here, maybe it can added to PseudoPotentialData and ignore the filename for UpfData. I can make a PR for it.

sphuber commented 8 months ago

I think the argument extends to all subclasses of PseudoPotentialData. There is a reason that it subclasses SinglefileData because (in principle) the file content should uniquely determine the pseudo. The metadata that we add in the attributes are just useful shortcuts to facilitate querying and proxy some common properties. They all come from the file, so their content is redundant.

I think that if we change this, we better change it for all to be consistent, and so put the change on PseudoPotentialData. And as mentioned before, instead of overriding _hash_ignored_attributes, I think we should simply override _get_objects_to_hash straight away and just have it return the hash of the file.

unkcpz commented 8 months ago

Besides the version of the plugin, I find a more worried issue that the calcjob node hash is changed if the aiida-core is updated. I think the version of the calcjob node should be ignored.

Here is a output of _get_objects_to_hash from a pw calculation. The hash is based on the aiida-core version that runs the calculation, which means everytime the aiida-core updated the rehash is needed, which is not expected I assume, if look at the operation needed in https://github.com/aiidateam/aiida-core/blob/main/CHANGELOG.md#v240---2023-06-22

['2.4.0',
 {'version': {'core': '2.4.0.post0', 'plugin': '4.4.0'},
  'withmpi': True,
  'resources': {'num_machines': 1, 'num_mpiprocs_per_machine': 36},
  'append_text': '',
  'parser_name': 'quantumespresso.pw',
  'prepend_text': '',
  'input_filename': 'aiida.in',
...
sphuber commented 8 months ago

Besides the version of the plugin, I find a more worried issue that the calcjob node hash is changed if the aiida-core is updated. I think the version of the calcjob node should be ignored.

You are right, the first element is the version of the package that defines the module of the node type, e.g., CalcJobNode and WorkChainNode. Since these can not be changed through plugins, this is always the version of aiida-core. It is a bit redundant actually, because it is anyway included in the version attribute. The reason these versions are included goes again back to what I mentioned before: when we implmented caching, we implemented it to be as conservative as possible. It looks like instead of monkeypatching things here in aiida-pseudo we should open a discussion with the aiida-core team to see if we can remove aiida-core and plugins versions from the hashes.

The hash is based on the aiida-core version that runs the calculation, which means everytime the aiida-core updated the rehash is needed, which is not expected I assume, if look at the operation needed in https://github.com/aiidateam/aiida-core/blob/main/CHANGELOG.md#v240---2023-06-22

Us forcing a database migration to drop the hashes in that case was different I would argue. There was a real bug in the calculation of the hash, which we fixed, but this would therefore invalidate the hash. The reason is not so much to make sure identical nodes are still matched, but rather that different nodes are not accidentally matched because the original hashes are incorrect. I guess that probability is really low.