Caching: Add `CACHE_VERSION` attribute to `CalcJob` and `Parser`

sphuber commented 3 months ago

Recently, the hashing mechanism was changed to remove all version information, both aiida-core and the plugin if applicable, from the hash computation. The reason was that with that information, each version change in core and plugin would essentially reset the cache. By removing the version information, updating the code would no longer invalidate all the cache.

The downside of this change, however, is that when the implementation of a plugin, most notably CalcJob and Parser plugins, changes significantly such that a change in its attributes/inputs would really change its content, this would no longer be reflected in the hash, which would remain identical. This could lead to false positives when users update certain plugins.

To give some control to plugin developers in case of changes where the hash would have to be effectively reset, the CalcJob and Parser classes now define the CACHE_VERSION class attribute. By default this is set to None but it can be set to an integer by a plugin developer. At this point, it is stored in the cache_version attribute under the calc_job or parser key, respectively. Since this attribute is included in the hash calculation, changing the value would effectively reset the cache for nodes generated with one of the plugins.

The concept could have been added to the Node base class to stay generic, however, this had complicates for Data nodes. Since the version data would have to be stored in the attributes, in order to prevent the risk from it being mutated or lost, it would interfere with the actual data of the node. For example, the Dict node is entirely defined by its attributes, so AiiDA cannot store any data in that namespace. Luckily, not having this cache version information in nodes other than CalcJobNodes is not a problematic, as the problem of inducing false positives by changing plugin code is really only relevant to CalcJob and Parser plugins.

~~Recently, the hashing mechanism was changed to remove all version information, both aiida-core and the plugin if applicable, from the hash computation. The reason was that with that information, each version change in core and plugin would essentially reset the cache. By removing the version information, updating the code would no longer invalidate all the cache.~~

~~The downside of this change, however, is that when the implementation of a plugin, most notably a CalcJob plugin, changes significantly such that a change in its attributes/inputs would really change its content, this would no longer be reflected in the hash, which would remain identical. This could lead to false positives when users update certain plugins.~~

~~To give some control to plugin developers in case of changes where the hash would have to be effectively reset, the Node class now defines the CACHE_VERSION class attribute. By default this is set to None but a Data plugin can change it to some string value. To facilitate the same API for process plugins, such as CalcJob's, the CACHE_VERSION attribute is also added to the Process base class. This value is automatically added to the node's attributes, under the key _aiida_cache_version such that is is included in the computed hash. This way, a plugin developer can change this attribute in their plugin to force the hash to change.~~

~~The version has to be stored in the attributes of the node and cannot just be taken from the plugin's class attribute directly when computing the hash, because in that case when the hash is recalculated in the future, a potentially different value of CACHE_VERSION could be used (whichever version the plugin has that is installed at that time).~~

sphuber commented 3 months ago

@danielhollas @unkcpz I tried to implement something that would allow plugin developers to still control cache invalidation if their plugins change significantly, after we remove all version information from hash calculation. However, it is not as easy as it seemed.

Essentially, we need to allow a plugin to define some class attribute that enters in the hash computation. This has to be stored on the node somehow though and cannot just be taken from the class when the node gets stored, because otherwise in the future, when the hash of the node is recalculated, it will take the current cache version of the plugin class, which may have been updated. Since it is important information, I didn't want to store this "cache version" to the extras, since it can get lost, since they are mutable. So the attributes are the only remaining place. However, this has all sorts of knock-on consequences: up till now, AiiDA never automatically stored "internal" values in the attributes column. This assumption is now broken and breaks all sorts of other parts of the code. Example, Dict.get_dict() will now include this _aiida_cache_version attribute that is automatically set. Now I can update the code to explicitly ignore this, but this is really a workaround and as we can see other tests are still failing due to this change. It is not clear to me yet whether this is acceptable, but it is a pretty significant departure from the status quo, so I am really hesitant. Not sure if there is another solution that I am missing for us to permanently store this cache version.

We have had talks in the past to have "internal" attributes that are set just by aiida-core's internals and never by plugins. This would require a significant change to the node model and the ORM. Not really to keen to having to do that at this point, but maybe that is the only proper way of solving this issue. Thoughts?

danielhollas commented 3 months ago

Yeah, I realized late last night the hash recalculation will make things tricky :-( Just thinking, doesn't this also apply to the current design that stores aiida and plugin versions in the hash?

sphuber commented 3 months ago

Just thinking, doesn't this also apply to the current design that stores aiida and plugin versions in the hash?

It does. Because currently the Node class automatically adds the module version: https://github.com/aiidateam/aiida-core/blob/1059b5f2d365e0f2ea4dea4d3c9343ce77829cfe/src/aiida/orm/nodes/caching.py#L67

which would be removed by #6215

Note that for calcjobs the version of both core and plugin is also stored in the attributes. But since that is fixed, that doesn't suffer from the same problem when rehashing. But #6215 also proposes to remove this from the hash calculation.

danielhollas commented 3 months ago

Note that for calcjobs the version of both core and plugin is also stored in the attributes. But since that is fixed, that doesn't suffer from the same problem when rehashing.

Sorry, I don't follow, can you expand on this part? If those versions are stored (under which key?) why is adding an extra field problematic?

sphuber commented 3 months ago

Sorry, I don't follow, can you expand on this part? If those versions are stored (under which key?) why is adding an extra field problematic?

The important part here is that the behavior is different for data and process nodes. The Process class will automatically store version information in the attributes of its ProcessNode, see: https://github.com/aiidateam/aiida-core/blob/69389e0387369d8437e1219487b88430b7b2e679/src/aiida/engine/processes/process.py#L713

This is fine for process nodes since those cannot be subclassed by plugins and so anyway they cannot control what is stored in the attributes. This is different for data plugins, though, where the attributes are essentially the one thing that plugins can control. As we have seen, the Dict node for example uses the attributes to store its content. If aiida-core now starts "polluting" this with internal attributes, that no longer gives full control to the plugin.

danielhollas commented 3 months ago

I see, that is subtle indeed. So if I understand correctly, the approach in this PR could work for Processes, because you could modify aiida-core to add it to ProcessNode attributes? Or does this have unintended side effects as well?

sphuber commented 3 months ago

I see, that is subtle indeed. So if I understand correctly, the approach in this PR could work for Processes, because you could modify aiida-core to add it to ProcessNode attributes? Or does this have unintended side effects as well?

Hmm, perhaps we could indeed only add it to ProcessNode's, because that is already what we are doing more or less. The problem is really only with Data nodes. I was focused on adding it there as well, because the Data implementation could also change. But I now remember that I already argumented a few weeks ago that arguably the implementation for Data plugins is less important and there ultimately the hash really is defined just by the data contents. So maybe adding this cache version just to process nodes is sufficient for our use case and we avoid all the problems by polluting the Data attributes namespace

unkcpz commented 3 months ago

The use case is clear to me now after a first go of reviewing. Just one comment.

I didn't want to store this "cache version" to the extras, since it can get lost, since they are mutable.

I think this can be a very straightforward implementation, the hash is already stored in the extra, I didn't see why it is a problem to not store the cache version to extra as well.

But I now remember that I already argumented a few weeks ago that arguably the implementation for Data plugins is less important and there ultimately the hash really is defined just by the data contents.

Indeed, the direct purpose is to avoid the changing of Process input/output interfaces that will bring the false positive.

Yeah, I realized late last night the hash recalculation will make things tricky :-(

@danielhollas If I understand correctly, there will be no hash recalculation, if the plugin update the cache version, the new ProcessNode will just has different hashing and the old ProcessNode is unreachable by new calculations. If roll-back the plugin version, and the old ProcessNode will because available again.

danielhollas commented 3 months ago

Yeah, that makes a lot of sense! I guess I don't even understand how caching works for Data node, or what even would be a purpose of that? I thought of caching as always tied to Processes?

By default this is set to None but a Data plugin can change it to some string value.

Any specific reason for this to be a string? An int seems simpler and more straightforward to think about, evicting the cache would simply be bumping it by +1.

danielhollas commented 3 months ago

I think this can be a very straightforward implementation, the hash is already stored in the extra, I didn't see why it is a problem to not store the cache version to extra as well.

Well, there's a difference in consequences when these extras are dropped. If the hash is dropped, the node should simply not be available for caching (right?) so you'll get a false negative for cache hit. But if the cache version somehow got dropped, then you could get false positive cache hits (but I guess only after the hash would also be recalculated?). These considerations do seem a bit academical. At the same time, adding it to Process attributes does not seem any more difficult in terms of implementation?

@danielhollas If I understand correctly, there will be no hash recalculation

Sure, but it's possible that due to DB migration induced by AiIDA-core, we would ask users to recalculate their hashes manually (as we did last time, although whether that's necessary is up to discussion I guess).

sphuber commented 3 months ago

Yeah, that makes a lot of sense! I guess I don't even understand how caching works for Data node, or what even would be a purpose of that? I thought of caching as always tied to Processes?

Data nodes are not really cached in that sense, they just have a hash which is used in the computation of the process' hash.

Any specific reason for this to be a string? An int seems simpler and more straightforward to think about, evicting the cache would simply be bumping it by +1.

No particular reason. Could even accept either. For now I haven't really thought about limitations/validation of this class attribute. First wanted to see if the generic concept would be possible.

If I understand correctly, there will be no hash recalculation, if the plugin update the cache version, the new ProcessNode will just has different hashing and the old ProcessNode is unreachable by new calculations. If roll-back the plugin version, and the old ProcessNode will because available again.

@unkcpz The problem here as Daniel has pointed out is that the public API (and CLI as well) have methods to clear the hash of the extras. And on top of that, the hash is just stored in the extras, which can be (accidentally) deleted. If at that point you have to recalculate the hash, you better make sure you have all the data that went in the original hash. So we need to permanently store this specific cache version (and cannot rely on extras as it can be accidentally removed).

sphuber commented 3 months ago

I have updated the implementation to just change the behavior for Process plugins. They can set the CACHE_VERSION class attribute, which will get set on the ProcessNode in the version.cache attribute. For now I put it there as it integrated nicely with the other version information of core and the plugin. This would just force @unkcpz 's implementation to change slightly as now he can no longer drop the version attribute in its entirety from the objects to go in the hash, but the version.cache subkey needs to be kept. Alternative would be to not nest it in the version attribute, but have it top-level in version_cache or something like that.

danielhollas commented 3 months ago

I briefly went through the implementation and mostly makes sense to me.

Now it got me thinking, should aiida-core use the same mechanism for invalidating caches, instead of forcing a DB migration? E.g. instead of putting a aiida version itself in the cache, have CACHE_VERSION defined somewhere (effectively globally). DB migration is a chore (I suspect), but the biggest issue is that it is not reversible.

sphuber commented 3 months ago

Now it got me thinking, should aiida-core use the same mechanism for invalidating caches, instead of forcing a DB migration? E.g. instead of putting a aiida version itself in the cache, have CACHE_VERSION defined somewhere (effectively globally). DB migration is a chore (I suspect), but the biggest issue is that it is not reversible.

Very interesting idea. I think this might actually work and would indeed get rid of having to add a database migration to reset the cache in case the implementation changes. I don't see any immediate problems with it, but as always, the devil is in the details with caching. I think the best thing would be to simply try so I think I will give it a shot and see what comes up.

sphuber commented 3 months ago

Thought the idea over a bit. Although it should properly "invalidate" existing caches since the hash of future processes with identical inputs will now be different, due to the changed CACHE_VERSION attribute, the stored hash of existing nodes will remain and be incorrect. It is therefore feasible, albeit very unlikely, that a future process can produce a hash that is identical to an existing one despite having different inputs. I guess then that this approach might be pragmatically functional but theoretically it is incorrect and leaves margin for false positives.

We might say that this theoretical problem is so unlikely that it doesn't really justify the inconvenience that migrations add. However, if we were to accept this solution, we would go back to the previous problem where we would have to store this global cache version somewhere on a node when it gets stored so later its hash can be recomputed. But with the current database layout, that could only be stored in the attributes, and that gives problems for the Dict class, and potentially other Data plugins. So we are back to square one.

I am therefore leaning to just keeping database migrations to invalidate hashes if aiida-core significantly updates the hashing algorithm, and in this PR we add the capability for plugin developers to invalidate the hashes of process plugins (i.e. effectively CalcJobs).

pinging @danielhollas

danielhollas commented 3 months ago

Sorry, on a conference this week. Need to think this through a bit, I'll try to reply on Wednesday if I am able. By Friday the latest.

danielhollas commented 3 months ago

Although it should properly "invalidate" existing caches since the hash of future processes with identical inputs will now be different, due to the changed CACHE_VERSION attribute, the stored hash of existing nodes will remain and be incorrect.

Hm, I am not sure I fully follow this scenario. What exactly do you mean by "incorrect" in this case? It seems to me that the hash will be consistent with the CACHE_VERSION that it was created with, no?

It is therefore feasible, albeit very unlikely, that a future process can produce a hash that is identical to an existing one despite having different inputs.

This seems to describe a hash collision? Those should be vanishingly unlikely right? I mean you can get a collision regardless of this scheme by pure bad luck? But I might be misunderstanding what you're trying to say.

However, if we were to accept this solution, we would go back to the previous problem where we would have to store this global cache version somewhere on a node when it gets stored so later its hash can be recomputed. But with the current database layout, that could only be stored in the attributes, and that gives problems for the Dict class, and potentially other Data plugins. So we are back to square one.

I thought that above we decided we don't really have to worry about the Data nodes, since they are defined by construction by their hash? I guess you're right that in case of aiida-core it might not be the same situation, if the core hashing algorithm is somehow changed.

I am therefore leaning to just keeping database migrations to invalidate hashes if aiida-core significantly updates the hashing algorithm, and in this PR we add the capability for plugin developers to invalidate the hashes of process plugins (i.e. effectively CalcJobs).

I think for this PR let's go with that. We can always decide to rethink/improve on the aiida-core later I guess.

sphuber commented 3 months ago

I agree that hash collisions are always a risk, but since they are inherent to hashing algorithms and extremely unlikely, we shouldn't worry about them. However, what I was trying to sketch out, is that hashing collisions are to be expected and fine as long as your hashing algorithm doesn't change. But here, we were talking about the scenario where the hashing implementation in aiida-core changes. It is a bit more complicated in aiida-core because our algorithm is composed of the actual hashing algorithm (currently blake2b) and the selection of data of the node that actually goes into the hash. Currently, we are mostly and have ever changed the latter part, but we could imagine changing the actual hashing algorithm.

What I was thinking in my comment above is that hash collissions are acceptable as long as you stay within the same algorithm. However, when we change the objects to hash, we are changing the algorithm and then collisions risks no longer follow the rules of the hashing algorithm that lies underneath, right?

Anyway, there is a more practical objection I think. Let's go back to why we want to introduce some mechanism to invalidate hashes. The reason we have been removing hash extras through database migrations is not to prevent false positives but rather to make sure that caching keeps working (we just didn't think that including the versions would fully undo this work 😅 ). If the hashing algorithm changes, the computed hash for a data node with attributes {x} would be different. A calculation that has that node as input will therefore not be cached, even though it should because the data is exactly the same. This is why we were resetting hashes, to make sure that old data nodes are properly matched with the new hashing algorithm. So the reset is not to invalidate the existing data node hashes, but to make sure they are up to date.

This is exactly the opposite motiviation for the feature we are discussing adding here, which is to give control to calcjob plugins to invalidate their hash if their implementation changes such that the outputs are expected to change even for identical inputs.

Conclusion: I think we should keep using database migrations if aiida-core changes the hashing algorithm to ensure that existing data nodes remain valid cache sources, but also provide a CACHE_VERSION attribute for CalcJob plugin writers to invalidate existing nodes.

danielhollas commented 3 months ago

This is exactly the opposite motiviation for the feature we are discussing adding here, which is to give control to calcjob plugins to invalidate their hash if their implementation changes such that the outputs are expected to change even for identical inputs.

That's a very good point. :+1: Agreed.

I am not sure what is the state of the current PR, feel free to ping me for review.

sphuber commented 3 months ago

I am not sure what is the state of the current PR, feel free to ping me for review.

I think the current summary of the situation is as follows:

Including the version (core and plugin) in the hash computation makes caching effectively useless, so we remove it (PR #6215)
Removing the version, however, runs the risk of false positives when CalcJob plugin implementations change significantly.
Therefore an attribute is added to the CalcJob class that is added to objects included in the hash. This attribute can be changed by plugin implementations to effectively reset the cache. (This PR)
Whenever the hashing algorithm in aiida-core is changed, a database migration is necessary to reset all hashes and recompute them. Without it, the existing cache would essentially be unusable.

This is currently implemented by PR #6215 and this PR. So I think this is ready for review.

sphuber commented 3 months ago

I just thought of one more potential complication. Adding a cache version for the CalcJob class may not be sufficient. The outputs of a calcjob are really the generation of inputs files from input nodes through the CalcJob as well as the creation of the outputs through the Parser. So far we never included the parser version separately because the argument was that typically that would be part of the same plugin package as the CalcJob and the package version was included in the hash calculation. This is now getting removed, but we are only putting the version of the CalcJob back in. I think we should add a similar attribute for the Parser.

danielhollas commented 3 months ago

I think we should add a similar attribute for the Parser.

Yeah, unfortunate but reasonable.

At some point I was also thinking about if we could somehow detect Parser/CalcJob changes automatically via some code introspection. But even if we were able to do that, it would kind of defeat the purpose since even non-functional changes would affect the caching behaviour.

sphuber commented 2 months ago

@danielhollas @unkcpz I have now implemented the final state of the discussion. It essentially adds the CACHE_VERSION class attributes to the Process and Parser classes. Plugin developers can now change this in their CalcJob and Parser plugins to reset the cache. If these versions are set, they will be written to the CalcJobNode attributes together with the version of aiida-core and the plugin package, i.e. node.base.attributes.get('version') would return something like:

{
    'core': '2.6.0',
    'plugin': '3.0.0',
    'cache': 'some-value', # Value of `CalcJob.CACHE_VERSION`
    'parser': 'other-value', # Value of `Parser.CACHE_VERSION`
}

Some open questions/action points:

The CACHE_VERSION is currently added to the Process base class, but really would probably only be used by the CalcJob class. Technically, calcfunctions are also considered for caching since they are calculation processes and not workflows, but there the process class FunctionProcess is created dynamically and so the developer cannot set the CACHE_VERSION. But then again they wouldn't have to because the function source code is included in the hash calculation I think, so changing it would anyway already invalidate the code. So maybe it would make more sense to move CACHE_VERSION to the CalcJob? Less "correct" in the generic sense, but certainly makes a whole lot more sense practically speaking.
The key cache for the CalcJob.CACHE_VERSION also makes little sense. If we agree that we move the attribute from the Process to the CalcJob class, then just naming this calc_job_cache_version and changing the parser to parser_cache_version would be clearest, right?
Currently there is no type-checking on the CACHE_VERSION. I have type checked it as str | None. I don't see a reason why we should be restrictive in principle. Unless this unrestrictiveness actually makes it more difficult for the user to adopt a sensible versioning strategy? I am not sure what the best approach would be here.
These changes would still require #6215 to be changed. Currently that drops the entire version attribute dictionary from the hash calculation, but it would have to keep the cache and parser keys. That is easy enough to do though once we sign off on the concept.
Once signed off, I will add documentation of course.

danielhollas commented 2 months ago

So maybe it would make more sense to move CACHE_VERSION to the CalcJob?

+1 If we realize we need it to be more generic, we can always lift it up the class chain.

then just naming this calc_job_cache_version and changing the parser to parser_cache_version would be clearest, right?

Seems ok to me.

Currently there is no type-checking on the CACHE_VERSION. I have type checked it as str | None. I don't see a reason why we should be restrictive in principle. Unless this unrestrictiveness actually makes it more difficult for the user to adopt a sensible versioning strategy? I am not sure what the best approach would be here.

I would vote to make it an int to make it clear that any versioning strategy other then "bump this integer by one" doesn't make sense since the purpose of this version is essentially only to invalidate the cache (e.g. semver doesn't make sense). I don't have a strong opinion about whether we should type-check this at runtime.

Currently that drops the entire version attribute dictionary from the hash calculation, but it would have to keep the cache and parser keys. That is easy enough to do though once we sign off on the concept.

I am wondering if it would be clearer to have a separate attribute "cache_version" or something like that. That would make it easier to see that we do not hash anything inside version, and the current code in #6215 would work as is and wouldn't need to get more complicated.

sphuber commented 2 months ago

I would vote to make it an int to make it clear that any versioning strategy other then "bump this integer by one" doesn't make sense since the purpose of this version is essentially only to invalidate the cache (e.g. semver doesn't make sense). I don't have a strong opinion about whether we should type-check this at runtime.

Done. For now there is no validation though. I am not sure we need it.

I am wondering if it would be clearer to have a separate attribute "cache_version" or something like that. That would make it easier to see that we do not hash anything inside version, and the current code in https://github.com/aiidateam/aiida-core/pull/6215 would work as is and wouldn't need to get more complicated.

Very good point, I like it. I have implemented it.

sphuber commented 2 months ago

@danielhollas this should be done for final review. I have added the docs as well

aiidateam / aiida-core

Caching: Add `CACHE_VERSION` attribute to `CalcJob` and `Parser` #6328