Closed sphuber closed 2 months ago
@danielhollas @unkcpz I tried to implement something that would allow plugin developers to still control cache invalidation if their plugins change significantly, after we remove all version information from hash calculation. However, it is not as easy as it seemed.
Essentially, we need to allow a plugin to define some class attribute that enters in the hash computation. This has to be stored on the node somehow though and cannot just be taken from the class when the node gets stored, because otherwise in the future, when the hash of the node is recalculated, it will take the current cache version of the plugin class, which may have been updated. Since it is important information, I didn't want to store this "cache version" to the extras, since it can get lost, since they are mutable. So the attributes
are the only remaining place. However, this has all sorts of knock-on consequences: up till now, AiiDA never automatically stored "internal" values in the attributes column. This assumption is now broken and breaks all sorts of other parts of the code. Example, Dict.get_dict()
will now include this _aiida_cache_version
attribute that is automatically set. Now I can update the code to explicitly ignore this, but this is really a workaround and as we can see other tests are still failing due to this change. It is not clear to me yet whether this is acceptable, but it is a pretty significant departure from the status quo, so I am really hesitant. Not sure if there is another solution that I am missing for us to permanently store this cache version.
We have had talks in the past to have "internal" attributes that are set just by aiida-core
's internals and never by plugins. This would require a significant change to the node model and the ORM. Not really to keen to having to do that at this point, but maybe that is the only proper way of solving this issue. Thoughts?
Yeah, I realized late last night the hash recalculation will make things tricky :-( Just thinking, doesn't this also apply to the current design that stores aiida and plugin versions in the hash?
Just thinking, doesn't this also apply to the current design that stores aiida and plugin versions in the hash?
It does. Because currently the Node
class automatically adds the module version:
https://github.com/aiidateam/aiida-core/blob/1059b5f2d365e0f2ea4dea4d3c9343ce77829cfe/src/aiida/orm/nodes/caching.py#L67
which would be removed by #6215
Note that for calcjobs the version of both core and plugin is also stored in the attributes. But since that is fixed, that doesn't suffer from the same problem when rehashing. But #6215 also proposes to remove this from the hash calculation.
Note that for calcjobs the version of both core and plugin is also stored in the attributes. But since that is fixed, that doesn't suffer from the same problem when rehashing.
Sorry, I don't follow, can you expand on this part? If those versions are stored (under which key?) why is adding an extra field problematic?
Sorry, I don't follow, can you expand on this part? If those versions are stored (under which key?) why is adding an extra field problematic?
The important part here is that the behavior is different for data and process nodes. The Process
class will automatically store version information in the attributes of its ProcessNode
, see:
https://github.com/aiidateam/aiida-core/blob/69389e0387369d8437e1219487b88430b7b2e679/src/aiida/engine/processes/process.py#L713
This is fine for process nodes since those cannot be subclassed by plugins and so anyway they cannot control what is stored in the attributes. This is different for data plugins, though, where the attributes are essentially the one thing that plugins can control. As we have seen, the Dict
node for example uses the attributes to store its content. If aiida-core
now starts "polluting" this with internal attributes, that no longer gives full control to the plugin.
I see, that is subtle indeed. So if I understand correctly, the approach in this PR could work for Processes, because you could modify aiida-core to add it to ProcessNode attributes? Or does this have unintended side effects as well?
I see, that is subtle indeed. So if I understand correctly, the approach in this PR could work for Processes, because you could modify aiida-core to add it to ProcessNode attributes? Or does this have unintended side effects as well?
Hmm, perhaps we could indeed only add it to ProcessNode
's, because that is already what we are doing more or less. The problem is really only with Data
nodes. I was focused on adding it there as well, because the Data
implementation could also change. But I now remember that I already argumented a few weeks ago that arguably the implementation for Data
plugins is less important and there ultimately the hash really is defined just by the data contents. So maybe adding this cache version just to process nodes is sufficient for our use case and we avoid all the problems by polluting the Data
attributes namespace
The use case is clear to me now after a first go of reviewing. Just one comment.
I didn't want to store this "cache version" to the extras, since it can get lost, since they are mutable.
I think this can be a very straightforward implementation, the hash is already stored in the extra, I didn't see why it is a problem to not store the cache version to extra as well.
But I now remember that I already argumented a few weeks ago that arguably the implementation for Data plugins is less important and there ultimately the hash really is defined just by the data contents.
Indeed, the direct purpose is to avoid the changing of Process input/output interfaces that will bring the false positive.
Yeah, I realized late last night the hash recalculation will make things tricky :-(
@danielhollas If I understand correctly, there will be no hash recalculation, if the plugin update the cache version, the new ProcessNode will just has different hashing and the old ProcessNode
is unreachable by new calculations.
If roll-back the plugin version, and the old ProcessNode
will because available again.
Yeah, that makes a lot of sense! I guess I don't even understand how caching works for Data node, or what even would be a purpose of that? I thought of caching as always tied to Processes?
By default this is set to None but a Data plugin can change it to some string value.
Any specific reason for this to be a string? An int
seems simpler and more straightforward to think about, evicting the cache would simply be bumping it by +1.
I think this can be a very straightforward implementation, the hash is already stored in the extra, I didn't see why it is a problem to not store the cache version to extra as well.
Well, there's a difference in consequences when these extras are dropped. If the hash is dropped, the node should simply not be available for caching (right?) so you'll get a false negative for cache hit. But if the cache version somehow got dropped, then you could get false positive cache hits (but I guess only after the hash would also be recalculated?). These considerations do seem a bit academical. At the same time, adding it to Process attributes does not seem any more difficult in terms of implementation?
@danielhollas If I understand correctly, there will be no hash recalculation
Sure, but it's possible that due to DB migration induced by AiIDA-core, we would ask users to recalculate their hashes manually (as we did last time, although whether that's necessary is up to discussion I guess).
Yeah, that makes a lot of sense! I guess I don't even understand how caching works for Data node, or what even would be a purpose of that? I thought of caching as always tied to Processes?
Data nodes are not really cached in that sense, they just have a hash which is used in the computation of the process' hash.
Any specific reason for this to be a string? An int seems simpler and more straightforward to think about, evicting the cache would simply be bumping it by +1.
No particular reason. Could even accept either. For now I haven't really thought about limitations/validation of this class attribute. First wanted to see if the generic concept would be possible.
If I understand correctly, there will be no hash recalculation, if the plugin update the cache version, the new ProcessNode will just has different hashing and the old ProcessNode is unreachable by new calculations. If roll-back the plugin version, and the old ProcessNode will because available again.
@unkcpz The problem here as Daniel has pointed out is that the public API (and CLI as well) have methods to clear the hash of the extras. And on top of that, the hash is just stored in the extras, which can be (accidentally) deleted. If at that point you have to recalculate the hash, you better make sure you have all the data that went in the original hash. So we need to permanently store this specific cache version (and cannot rely on extras as it can be accidentally removed).
I have updated the implementation to just change the behavior for Process
plugins. They can set the CACHE_VERSION
class attribute, which will get set on the ProcessNode
in the version.cache
attribute. For now I put it there as it integrated nicely with the other version information of core and the plugin. This would just force @unkcpz 's implementation to change slightly as now he can no longer drop the version
attribute in its entirety from the objects to go in the hash, but the version.cache
subkey needs to be kept. Alternative would be to not nest it in the version
attribute, but have it top-level in version_cache
or something like that.
I briefly went through the implementation and mostly makes sense to me.
Now it got me thinking, should aiida-core use the same mechanism for invalidating caches, instead of forcing a DB migration? E.g. instead of putting a aiida version itself in the cache, have CACHE_VERSION defined somewhere (effectively globally). DB migration is a chore (I suspect), but the biggest issue is that it is not reversible.
Now it got me thinking, should aiida-core use the same mechanism for invalidating caches, instead of forcing a DB migration? E.g. instead of putting a aiida version itself in the cache, have CACHE_VERSION defined somewhere (effectively globally). DB migration is a chore (I suspect), but the biggest issue is that it is not reversible.
Very interesting idea. I think this might actually work and would indeed get rid of having to add a database migration to reset the cache in case the implementation changes. I don't see any immediate problems with it, but as always, the devil is in the details with caching. I think the best thing would be to simply try so I think I will give it a shot and see what comes up.
Thought the idea over a bit. Although it should properly "invalidate" existing caches since the hash of future processes with identical inputs will now be different, due to the changed CACHE_VERSION
attribute, the stored hash of existing nodes will remain and be incorrect. It is therefore feasible, albeit very unlikely, that a future process can produce a hash that is identical to an existing one despite having different inputs. I guess then that this approach might be pragmatically functional but theoretically it is incorrect and leaves margin for false positives.
We might say that this theoretical problem is so unlikely that it doesn't really justify the inconvenience that migrations add. However, if we were to accept this solution, we would go back to the previous problem where we would have to store this global cache version somewhere on a node when it gets stored so later its hash can be recomputed. But with the current database layout, that could only be stored in the attributes, and that gives problems for the Dict
class, and potentially other Data
plugins. So we are back to square one.
I am therefore leaning to just keeping database migrations to invalidate hashes if aiida-core
significantly updates the hashing algorithm, and in this PR we add the capability for plugin developers to invalidate the hashes of process plugins (i.e. effectively CalcJob
s).
pinging @danielhollas
Sorry, on a conference this week. Need to think this through a bit, I'll try to reply on Wednesday if I am able. By Friday the latest.
Although it should properly "invalidate" existing caches since the hash of future processes with identical inputs will now be different, due to the changed CACHE_VERSION attribute, the stored hash of existing nodes will remain and be incorrect.
Hm, I am not sure I fully follow this scenario. What exactly do you mean by "incorrect" in this case? It seems to me that the hash will be consistent with the CACHE_VERSION that it was created with, no?
It is therefore feasible, albeit very unlikely, that a future process can produce a hash that is identical to an existing one despite having different inputs.
This seems to describe a hash collision? Those should be vanishingly unlikely right? I mean you can get a collision regardless of this scheme by pure bad luck? But I might be misunderstanding what you're trying to say.
However, if we were to accept this solution, we would go back to the previous problem where we would have to store this global cache version somewhere on a node when it gets stored so later its hash can be recomputed. But with the current database layout, that could only be stored in the attributes, and that gives problems for the Dict class, and potentially other Data plugins. So we are back to square one.
I thought that above we decided we don't really have to worry about the Data nodes, since they are defined by construction by their hash? I guess you're right that in case of aiida-core it might not be the same situation, if the core hashing algorithm is somehow changed.
I am therefore leaning to just keeping database migrations to invalidate hashes if aiida-core significantly updates the hashing algorithm, and in this PR we add the capability for plugin developers to invalidate the hashes of process plugins (i.e. effectively CalcJobs).
I think for this PR let's go with that. We can always decide to rethink/improve on the aiida-core later I guess.
I agree that hash collisions are always a risk, but since they are inherent to hashing algorithms and extremely unlikely, we shouldn't worry about them. However, what I was trying to sketch out, is that hashing collisions are to be expected and fine as long as your hashing algorithm doesn't change. But here, we were talking about the scenario where the hashing implementation in aiida-core
changes. It is a bit more complicated in aiida-core
because our algorithm is composed of the actual hashing algorithm (currently blake2b) and the selection of data of the node that actually goes into the hash. Currently, we are mostly and have ever changed the latter part, but we could imagine changing the actual hashing algorithm.
What I was thinking in my comment above is that hash collissions are acceptable as long as you stay within the same algorithm. However, when we change the objects to hash, we are changing the algorithm and then collisions risks no longer follow the rules of the hashing algorithm that lies underneath, right?
Anyway, there is a more practical objection I think. Let's go back to why we want to introduce some mechanism to invalidate hashes. The reason we have been removing hash extras through database migrations is not to prevent false positives but rather to make sure that caching keeps working (we just didn't think that including the versions would fully undo this work 😅 ). If the hashing algorithm changes, the computed hash for a data node with attributes {x}
would be different. A calculation that has that node as input will therefore not be cached, even though it should because the data is exactly the same. This is why we were resetting hashes, to make sure that old data nodes are properly matched with the new hashing algorithm. So the reset is not to invalidate the existing data node hashes, but to make sure they are up to date.
This is exactly the opposite motiviation for the feature we are discussing adding here, which is to give control to calcjob plugins to invalidate their hash if their implementation changes such that the outputs are expected to change even for identical inputs.
Conclusion: I think we should keep using database migrations if aiida-core
changes the hashing algorithm to ensure that existing data nodes remain valid cache sources, but also provide a CACHE_VERSION
attribute for CalcJob
plugin writers to invalidate existing nodes.
This is exactly the opposite motiviation for the feature we are discussing adding here, which is to give control to calcjob plugins to invalidate their hash if their implementation changes such that the outputs are expected to change even for identical inputs.
That's a very good point. :+1: Agreed.
I am not sure what is the state of the current PR, feel free to ping me for review.
I am not sure what is the state of the current PR, feel free to ping me for review.
I think the current summary of the situation is as follows:
CalcJob
plugin implementations change significantly.CalcJob
class that is added to objects included in the hash. This attribute can be changed by plugin implementations to effectively reset the cache. (This PR)aiida-core
is changed, a database migration is necessary to reset all hashes and recompute them. Without it, the existing cache would essentially be unusable.This is currently implemented by PR #6215 and this PR. So I think this is ready for review.
I just thought of one more potential complication. Adding a cache version for the CalcJob
class may not be sufficient. The outputs of a calcjob are really the generation of inputs files from input nodes through the CalcJob
as well as the creation of the outputs through the Parser
. So far we never included the parser version separately because the argument was that typically that would be part of the same plugin package as the CalcJob
and the package version was included in the hash calculation. This is now getting removed, but we are only putting the version of the CalcJob
back in. I think we should add a similar attribute for the Parser
.
I think we should add a similar attribute for the Parser.
Yeah, unfortunate but reasonable.
At some point I was also thinking about if we could somehow detect Parser/CalcJob changes automatically via some code introspection. But even if we were able to do that, it would kind of defeat the purpose since even non-functional changes would affect the caching behaviour.
@danielhollas @unkcpz I have now implemented the final state of the discussion. It essentially adds the CACHE_VERSION
class attributes to the Process
and Parser
classes. Plugin developers can now change this in their CalcJob
and Parser
plugins to reset the cache. If these versions are set, they will be written to the CalcJobNode
attributes together with the version of aiida-core
and the plugin package, i.e. node.base.attributes.get('version')
would return something like:
{
'core': '2.6.0',
'plugin': '3.0.0',
'cache': 'some-value', # Value of `CalcJob.CACHE_VERSION`
'parser': 'other-value', # Value of `Parser.CACHE_VERSION`
}
Some open questions/action points:
CACHE_VERSION
is currently added to the Process
base class, but really would probably only be used by the CalcJob
class. Technically, calcfunctions are also considered for caching since they are calculation processes and not workflows, but there the process class FunctionProcess
is created dynamically and so the developer cannot set the CACHE_VERSION
. But then again they wouldn't have to because the function source code is included in the hash calculation I think, so changing it would anyway already invalidate the code. So maybe it would make more sense to move CACHE_VERSION
to the CalcJob
? Less "correct" in the generic sense, but certainly makes a whole lot more sense practically speaking.cache
for the CalcJob.CACHE_VERSION
also makes little sense. If we agree that we move the attribute from the Process
to the CalcJob
class, then just naming this calc_job_cache_version
and changing the parser
to parser_cache_version
would be clearest, right?CACHE_VERSION
. I have type checked it as str | None
. I don't see a reason why we should be restrictive in principle. Unless this unrestrictiveness actually makes it more difficult for the user to adopt a sensible versioning strategy? I am not sure what the best approach would be here.version
attribute dictionary from the hash calculation, but it would have to keep the cache
and parser
keys. That is easy enough to do though once we sign off on the concept.So maybe it would make more sense to move CACHE_VERSION to the CalcJob?
+1 If we realize we need it to be more generic, we can always lift it up the class chain.
then just naming this calc_job_cache_version and changing the parser to parser_cache_version would be clearest, right?
Seems ok to me.
Currently there is no type-checking on the CACHE_VERSION. I have type checked it as str | None. I don't see a reason why we should be restrictive in principle. Unless this unrestrictiveness actually makes it more difficult for the user to adopt a sensible versioning strategy? I am not sure what the best approach would be here.
I would vote to make it an int
to make it clear that any versioning strategy other then "bump this integer by one" doesn't make sense since the purpose of this version is essentially only to invalidate the cache (e.g. semver doesn't make sense). I don't have a strong opinion about whether we should type-check this at runtime.
Currently that drops the entire version attribute dictionary from the hash calculation, but it would have to keep the cache and parser keys. That is easy enough to do though once we sign off on the concept.
I am wondering if it would be clearer to have a separate attribute "cache_version" or something like that. That would make it easier to see that we do not hash anything inside version
, and the current code in #6215 would work as is and wouldn't need to get more complicated.
I would vote to make it an int to make it clear that any versioning strategy other then "bump this integer by one" doesn't make sense since the purpose of this version is essentially only to invalidate the cache (e.g. semver doesn't make sense). I don't have a strong opinion about whether we should type-check this at runtime.
Done. For now there is no validation though. I am not sure we need it.
I am wondering if it would be clearer to have a separate attribute "cache_version" or something like that. That would make it easier to see that we do not hash anything inside version, and the current code in https://github.com/aiidateam/aiida-core/pull/6215 would work as is and wouldn't need to get more complicated.
Very good point, I like it. I have implemented it.
@danielhollas this should be done for final review. I have added the docs as well
Recently, the hashing mechanism was changed to remove all version information, both
aiida-core
and the plugin if applicable, from the hash computation. The reason was that with that information, each version change in core and plugin would essentially reset the cache. By removing the version information, updating the code would no longer invalidate all the cache.The downside of this change, however, is that when the implementation of a plugin, most notably
CalcJob
andParser
plugins, changes significantly such that a change in its attributes/inputs would really change its content, this would no longer be reflected in the hash, which would remain identical. This could lead to false positives when users update certain plugins.To give some control to plugin developers in case of changes where the hash would have to be effectively reset, the
CalcJob
andParser
classes now define theCACHE_VERSION
class attribute. By default this is set toNone
but it can be set to an integer by a plugin developer. At this point, it is stored in thecache_version
attribute under thecalc_job
orparser
key, respectively. Since this attribute is included in the hash calculation, changing the value would effectively reset the cache for nodes generated with one of the plugins.The concept could have been added to the
Node
base class to stay generic, however, this had complicates forData
nodes. Since the version data would have to be stored in the attributes, in order to prevent the risk from it being mutated or lost, it would interfere with the actual data of the node. For example, theDict
node is entirely defined by its attributes, so AiiDA cannot store any data in that namespace. Luckily, not having this cache version information in nodes other thanCalcJobNode
s is not a problematic, as the problem of inducing false positives by changing plugin code is really only relevant toCalcJob
andParser
plugins.~~Recently, the hashing mechanism was changed to remove all version information, both
aiida-core
and the plugin if applicable, from the hash computation. The reason was that with that information, each version change in core and plugin would essentially reset the cache. By removing the version information, updating the code would no longer invalidate all the cache.~~~~The downside of this change, however, is that when the implementation of a plugin, most notably a
CalcJob
plugin, changes significantly such that a change in its attributes/inputs would really change its content, this would no longer be reflected in the hash, which would remain identical. This could lead to false positives when users update certain plugins.~~~~To give some control to plugin developers in case of changes where the hash would have to be effectively reset, the
Node
class now defines theCACHE_VERSION
class attribute. By default this is set toNone
but aData
plugin can change it to some string value. To facilitate the same API for process plugins, such asCalcJob
's, theCACHE_VERSION
attribute is also added to theProcess
base class. This value is automatically added to the node's attributes, under the key_aiida_cache_version
such that is is included in the computed hash. This way, a plugin developer can change this attribute in their plugin to force the hash to change.~~~~The version has to be stored in the attributes of the node and cannot just be taken from the plugin's class attribute directly when computing the hash, because in that case when the hash is recalculated in the future, a potentially different value of
CACHE_VERSION
could be used (whichever version the plugin has that is installed at that time).~~