aiidateam / aiida-core

The official repository for the AiiDA code
https://aiida-core.readthedocs.io
Other
431 stars 187 forks source link

Caching: ph process failed if previous pw.x process finished_ok and cached but remote folder cleaned. #5178

Open unkcpz opened 2 years ago

unkcpz commented 2 years ago

Describe the bug

This problem happened when chain the PwBaseWorkChain and PwBaseWorkChain in a Workflow and using QE plugin with caching on for aiida.calculations:quantumespresso.pw. The quantumespresso.ph CalcJob require the remote folder of quantumespresso.pw run as input, but when it pw process finishes and remote folder is cleaned, the quantumespresso.ph will continues to fail since it can not found the saved files to do the calculation.

Steps to reproduce

Steps to reproduce the behavior:

This my workflow code to calculate the phonon frequencies, which do a SCF calculation and set the scf_remote_folder for next ph process step.

class PhononFrequenciesWorkChain(WorkChain):
    """WorkChain to calculate cohisive energy of input structure"""
    @classmethod
    def define(cls, spec):
    ....

    ....
    def inspect_scf(self):
        """inspect the result of scf calculation if fail do not continue."""
        workchain = self.ctx.workchain_scf

        if not workchain.is_finished_ok:
            self.report(
                f'PwBaseWorkChain failed with exit status {workchain.exit_status}'
            )
            return self.exit_codes.ERROR_SUB_PROCESS_FAILED_SCF

        try:
            self.ctx.scf_remote_folder = workchain.outputs.remote_folder
        except NotExistentAttributeError:
            return self.exit_codes.ERROR_NO_REMOTE_FOLDER

    def run_ph(self):
        """
        set the inputs and submit ph calculation to get quantities for phonon evaluation
        """
        inputs = {
            'metadata': {
                'call_link_label': 'PH'
            },
            'ph': {
                'code': self.inputs.ph_code,
                'qpoints': self.ctx.qpoints,
                'parameters': orm.Dict(dict=self.ctx.ph_parameters),
                'parent_folder': self.ctx.scf_remote_folder,
                'metadata': {
                    'options': self.ctx.options,
                },
                # CMDLINE setting for parallelization
            },
        }

        running = self.submit(PhBaseWorkflow, **inputs)
        self.report(f'Running ph calculation pk={running.pk}')
        return ToContext(workchain_ph=running)

Expected behavior

The caching mechanism will also detect the remote folder, if it finds the required remote folder is cleaned, do not use the caching process but rerun the previous calculation.

Your environment

Other relevant software versions, e.g. Postres & RabbitMQ

Additional context

sphuber commented 2 years ago

The caching mechanism will also detect the remote folder, if it finds the required remote folder is cleaned

This check may not be so trivial to actually implement. Even if there are still some files, they may no longer be complete or have been altered. So you could do a simple check and if it is empty, but you would still miss cases where only some files have been deleted. Still, this may already catch many cases and prevent the problem you describe, so this is certainly something we can consider. Only downside I see is that this check requires opening a transport and this brings the usual problem of making sure it is properly pooled (which is not trivial) and it will of course incur a significant overhead slowing everything down. We should analyze if this is worth baking in, or whether we put the onus of all of this on the user. If they know that a calculation no longer is a valid cache because its remote is cleaned, they either remove it as non-cacheable or they temporarily disable caching while relaunching the workchain.

do not use the caching process but rerun the previous calculation.

This is also problematic. I think it is impossible to write generic code in aiida-core that will ensure that any workchain that runs a CalcJob that relies on inputs that are outputs from another CalcJob, that reruns that preceding CalcJob if its remote_folder is "dirty". This would require dynamically inserting new steps into the outline logic. If we were to add support for the caching mechanism to include a notion of a remote_folder being dirty, than I think we should put that on the level of the calculation being cached. So in this example, when the PwCalculation is being submitted, the caching mechanism should check then already if the remote_folder is still in tact, and if not, conclude the node is no longer a valid cache source.

unkcpz commented 2 years ago

Thanks @sphuber !

The problem comes from that the caching mechanism doesn't know the update of the remote folder. So there will be two main ways to work around this. First, check the remote_folder before calling caching. Second, for this case, maybe I can change the hash of the process when I do the clean(). I can put it in my workflow, and if this makes sense, maybe we can implement this to _clean() method and always do this after cleaning the remote folder.

sphuber commented 2 years ago

First, check the remote_folder before calling caching.

Sure, but as I said, what are you going to check for? Just checking whether it is empty is not sufficient

Second, for this case, maybe I can change the hash of the process when I do the clean(). I can put it in my workflow, and if this makes sense, maybe we can implement this to _clean() method and always do this after cleaning the remote folder.

Also here, this will "fix" the case where we clean the remote from AiiDA, but typically, since these remote folders live on a scratch space, they will be automatically cleaned outside of AiiDA as well, and so then we cannot rely on the remote_folder being tagged somehow.

What I am saying is that in specific use cases there are workarounds to fix it, but I don't see an easy way forward for a generic fix.

unkcpz commented 2 years ago

First, check the remote_folder before calling caching.

Sure, but as I said, what are you going to check for? Just checking whether it is empty is not sufficient

Yes, I totally agree with you. That's why I propose the following workaround.

Also here, this will "fix" the case where we clean the remote from AiiDA, but typically, since these remote folders live on a scratch space, they will be automatically cleaned outside of AiiDA as well, and so then we cannot rely on the remote_folder being tagged somehow.

Sure there are some other ways outside AiiDA will clean the remote folder. But the remote_folder._clean() will always change the remote folder therefore can be safe to say that will change the process. My point is this can be a generic fix for _clean() method. Every time we call clean() we change the input remote_folder hash so the process is not cached. The same code can be included in other operations which definitely modified the remote folder.

sphuber commented 2 years ago

Every time we call clean() we change the input remote_folder hash so the process is not cached. The same code can be included in other operations which definitely modified the remote folder.

I think what you mean is that if the remote_folder output node of a CalcJobNode is cleaned, than it should change (probably simply remove, as that is what currently the unofficial way of invalidating a process node as a valid cache node) of the CalcJobNode and not the RemoteData node. When considering if a calculation should be cached from, only its inputs are taken into account, not its outputs. Or am I misunderstanding you?

unkcpz commented 2 years ago

I think what you mean is that if the remote_folder output node of a CalcJobNode is cleaned, than it should change (probably simply remove, as that is what currently the unofficial way of invalidating a process node as a valid cache node) of the CalcJobNode and not the RemoteData node.

Yes, that is what I mean. We should have a way to set CalcJobNode as 'unable to be used as caching node' if its RemoteData node is cleaned.

Every time we call clean() we change the input remote_folder hash so the process is not cached

I was wrong about this, remote_folder is a RemoteData output node. Thanks for clarifying this.

giovannipizzi commented 2 years ago

As I mentioned earlier in the meeting, I think the best approach is that, if you are concerned by this, you should have some periodic check that goes in the relevant folders and invalidates the cache for the parent CalcJob. I think this amounts to removing the _aiida_hash extra from the node. The think we are missing on the AiiDA side and we can add is having a Node API to do this (and not go fiddling with the _aiida_* extras, whose behaviour might change in the future. So this would be good to add.

Regarding the cleaning, indeed maybe when one cleans a RemoteData node one can call also the cache invalidation of the parent; I don't know if it's better to do this at the AiiDA level (maybe an optional parameter invalidate_parent_cache=True in the clean() method, that will do nothing is there is no CalcJob caller for the RemoteData); and/or suggest to do this for the various plugins that implement cleaning logic.

sphuber commented 2 years ago

See #5189 for adding official API to disable a node for caching

unkcpz commented 2 years ago

The other problem is when I rerun the workchain with caching off and do not do the clean. After turning on the caching back, what I expect is the new cached node will be used, but the caching mechanism still uses the old invalid cached node rather than the latest one.

So there is another 'good to have' feature for this case is that when there is more than one cached node (all the same inputs) that can be used, always use the latest one. I think this would be a more generic feature.

giovannipizzi commented 2 years ago

I see. Question to discuss: is using the latest node always the 'right' thing to do? (Maybe yes, but happy to get feedback, e.g. from @greschd). Maybe this should be configurable? (I mean in 'verdi config', per profile, where maybe the default is 'pick the latest' but you have options to e.g. pick the oldest, if we think there's a usecase for that).

Otherwise, one other thing to do is to define a function to search for all nodes that have the same hash and remove the hash key for those that match certain rules (e.g.: remove the hash from all duplicates except the newest?) Or this could create problems (maybe we should do this only for CalcJob nodes and not for Data nodes?).

greschd commented 2 years ago

I think the "logical" place to control this would be the is_valid_cache property, see https://aiida.readthedocs.io/projects/aiida-core/en/latest/internals/engine.html#controlling-caching

The main question is: should all calculations which have a "disappeared" RemoteData be considered invalid w.r.t. caching, or does the plugin need some control over that?

The main reason this currently can't be controlled by the plugin is because all calculations have the same node type[1] -- but it might make sense to add a hook with which CalcJob classes can modify this behavior. Either specific to this problem (analogous to the exit codes "invalidates cache" functionality), or by calling out to a method implemented on the plugin CalcJob.

ramirezfranciscof commented 2 years ago

I remember running into this problem and spending a lot of time thinking about it without being able to find any sort of silver bullet solution. I guess the main issue could be summarized in that relying on a RemoteData node (which is one of the few data node types in AiiDA whose integrity is not guaranteed) for chaining calcjobs within a workflow make the concept of "cache-validity" relative to conditions outside of the cached calcjob. A few unfinished ideas that came to mind and may be relevant:

(0) Besides implementing possible working solutions, it would already be very helpful if there was a way to include warnings and error messages that would allow the user to realize what the problem is. I remember part of what drove me crazy was that I spend a lot of time and manual manipulation until I realized AiiDA wasn't finding my files because of this (this wasn't even in my mind since I had enabled caching a long time before running into that problem). Maybe something that, whenever you get an error in the copy step, checks if the input has some RemoteData that comes from a cached node or something like this and mentions this in the error log.

(1) I think we need some method(s) to check the integrity of RemoteData nodes. Something that could work maybe is to make these sealable and, at the time of sealing (which would be when the calcjob gets sealed), a list of files within the folder + hashes for the content of each file are stored within the node.

(2) It is only at the moment of submitting the new processes that one know what calcjobs are going to need RemoteData nodes. Thus it would be nice, instead of having to say "this old node is invalid", to be able to say "this new run I want to do can't be cached" (either at all or if the integrity check of the RemoteData fails). Does something like this currently exist? (without using global cache settings I mean)

(3) If we have a workflow that goes CJ1 -> RemoteData -> CJ2 and both CalcJobs CJ1 and CJ2 already exist, then the "whole process" should be considered cached. But within the workflow if we have the previous feature 2 and we selected CJ1 as a must redo, then we are doing at least that CalcJob unnecessarily. Moreover I'm not sure what would happen to CJ2 if it would also run again or if it would be taken from the cached node (i.e. does a different ParentFolder input imply a different hash?). For this perhaps it would also be good if the workflow designer could use a get_cached_node that would allow him to do something like:

def get_cached_calls():

    CJ1_builder = ...

    CJ1_oldnode = get_cached_node(CJ1_builder)
    if CJ1_oldnode is None:
        return None

    CJ2_builder = ...(set inputs from CJ1_oldnode)...

    CJ2_oldnode = get_cached_node(CJ2_builder)
    if CJ2_oldnode is None:
        return None

   return (CJ1_oldnode, CJ2_oldnode)

if get_cached_calls() is not None:
    # run workchain without passing the feature 2 option to force re-runs and all will be cached
else:
    # run workchain passing the feature 2 option to force re-runs on all

(4) The previous point 3 is basically running the workflow caching everything until the point where you can't cache one of the calls, in which case you go back and start re-running everything. In principle this could be automated by AiiDA without the user needing to write the method themselves since it only "imitates" the rest of the workchain. Moreover, using the sub-process call structure and some sort of recursive logic, it could even be possible to identify and separate branches of subprocesses that are ok to be re-used (since no RemoteData is needed from them) and branches that need to be re-run.

All of these points I say very lightly, but I realize each of them is actually very complex (specially 4). But it is a complex problem itself, and an important one: otherwise, I think the usefulness of caching will be reduced dramatically as workflows get bigger if there is no way to somehow be more "context-aware" when deciding if something should be re-used or re-run.

unkcpz commented 2 years ago

Thanks @ramirezfranciscof @greschd @giovannipizzi for all the inspirational ideas. All deserve to look into.

I just open a minor PR https://github.com/aiidateam/aiida-core/pull/5197 to add the feature for caching the latest CalcJobNode, and I think that's kind of a different issue (workaround for this issue) and better to discuss separately. Very glad if you can have a look at it.

greschd commented 2 years ago

Great points @ramirezfranciscof!

(1) yes, absolutely we need a way to check for RemoteData integrity

(2) there is a context manager to temporarily switch caching off, but that is a very limited solution

(3)+(4) I think that's really the crux of the problem: if we want "perfect" caching behavior with RemoteData, it's a problem that cannot be decided at the time when a calculation is (or isn't) cached. It's a function of the downstream processes whether the RemoteData will be used again. Some way of "backtracking" seems the only way to properly address this.

otherwise, I think the usefulness of caching will be reduced dramatically as workflows get bigger if there is no way to somehow be more "context-aware" when deciding if something should be re-used or re-run.

I think one way to somewhat mitigate with the current code is to design workflows such that there aren't long "chains of RemoteData". Self-contained parts which pass data along as RemoteData can still be cached as a whole, as long as all the downstream processes consume is the persisted data. To me, this makes intuitive sense: If the data is going to be used by a far-away process, it should be persisted. If we imagine solution (4) being implemented, it would still be a good design decision: It limits how many calculations would need to be re-done if a RemoteData that is being used has vanished.

does a different ParentFolder input imply a different hash

yes (haven't checked now, but from memory that's what it should do)

sphuber commented 2 years ago

Just wanted to say that any sort of code that is going to try and go ahead in time of the call stack to see what sub processes will be cachable is essentially not possible. Forgetting the complexity of having to parse arbitrary Python code within the workchain outline logic to see exactly what is being submitted with what inputs, it will probably be impossible to even know exactly what inputs will be used (therefore making it impossible to compute the hash) as the inputs are often created during the workflow, either as a result of a previous process or created inline. Don't want to be pessimistic, but I really doubt this will ever be feasible. The less ambitious plan of running the workflow with caching and backtracking when needed is a lot more feasible but still extremely challenging.

Instead, I would focus more on the simpler solutions suggested:

greschd commented 2 years ago

This can then be used to see if the hierarchy is intact and on top of that check that the hashes of the files are still the same. My main worry is whether the overhead of this checking would be too big for it to be running by default and so maybe it should be something to be turned on explicitly for those who need it.

In general, checking the hierarchy should be much cheaper than computing a hash over what are often very large files. So maybe hierarchy checking can be turned on by default -- although you still have a round-trip to the remote computer in there. Since this happens when deciding if an entire calculation should be launched, it seems ok to me.

Maybe we can add an input to the metadata of processes that can also specify these caching configuration overrides.

I like that idea, but wonder if we then might need to add different kinds of configuration to really make it useful. For example excluding a specific calculation, or turning on/off the aforementioned RemoteData check. That would also allow settings defaults for this behavior in workflows. A user that doesn't want to deal with these issues (and is willing to pay some compute time for it) could globally turn on the RemoteData check.

chrisjsewell commented 2 years ago

I like the idea of storing a representation of the contents of a RemoteData after the creating CalcJob has finished. With the new repository implementation, we could simply use the repository_metadata column to store the virtual hierarchy.

I don't think we should override the default behaviour of RemoteData, I feel this is feature creep for a mainly "power-user" use-case:

  1. It is perfectly feasible for there to be many 1000s of files generated in a remote folder. If we now have to store all this path information for every computation, it is going to add a lot of unnecessary bloat to the database for most use cases.
  2. Definitely do not use repository_metadata; that is intrinsically linked to the repository and treated as such by maintenance and export/import functions.

I could certainly envisage a CachedRemoteData class, which can be explicitly used for this purpose by workchains that know they need this feature, although I haven't given any thought yet to how it would be generated and the interplay between calculation and workchain. I could also envisage some way "up-front" to specifiy what files you actually want to cache, rather than blanket caching every file/folder.

sphuber commented 2 years ago

It is perfectly feasible for there to be many 1000s of files generated in a remote folder. If we now have to store all this path information for every computation, it is going to add a lot of unnecessary bloat to the database for most use cases.

This is the concern I also expressed, that for the problem of caching I can see this as currently the best way forward, but I am not sure if we should make this the default behavior because of the probably significant overhead, both in terms of processing power and storage.

Definitely do not use repository_metadata; that is intrinsically linked to the repository and treated as such by maintenance and export/import functions.

Maybe indeed not the best idea. What I was mostly thinking was to reuse the structure of the repository_metadata, as in the dictionary form of virtual hierarchy with (optionally) hashes of file content. We could decide to store it as an attribute.

chrisjsewell commented 2 years ago

yep, just wanted to highlight the storage overhead (on top of the processing one) 👍

ramirezfranciscof commented 2 years ago

To address any performance or storage issue we could just have a default behavior that stores a hash of the file hierarchy instead of the hierarchy itself. We may later want to introduce more detailed options (such as creating a hash with the file structure + file content, or having different hashes for each file), but I think these are not critical for what we need here at this point. Bah, except maybe we would want the option to enable saving the whole hierarchy just for doing sanity checks when debugging something, since I can already see silly problems being almost impossible to detect if you only have a hash to compare...

Just wanted to say that any sort of code that is going to try and go ahead in time of the call stack to see what sub processes will be cachable is essentially not possible. Forgetting the complexity of having to parse arbitrary Python code within the workchain outline logic to see exactly what is being submitted with what inputs, it will probably be impossible to even know exactly what inputs will be used (therefore making it impossible to compute the hash) as the inputs are often created during the workflow, either as a result of a previous process or created inline. Don't want to be pessimistic, but I really doubt this will ever be feasible. The less ambitious plan of running the workflow with caching and backtracking when needed is a lot more feasible but still extremely challenging.

Yeah, I agree with this, my item (3) was just an example of how someone could manually try to fix this in their own workchains (+ good practices in "workchain modulatization", like @greschd mentioned) if we had some base tools (a way to check RemoteData integrity, a way to know if a process can be stashed, a way to force re-running in specific submits). But probably not a model for how to design a more automated solution (for which I concur that we could leverage some of the already existing infrastructure).

giovannipizzi commented 2 years ago

Thanks for the very interesting discussion.

Personally, because of the concerns expressed above, I would try to avoid making this very complex.

And I would avoid storing which files are in the remote data unless we're sure this can really help (BTW I agree we shouldn't store them in the repository_metadata, at worst with the same structure as attributes [even if we need to define updatable attributes for data nodes then, as these might change over time] or just as extras [might be the best approach]).

But I would still avoid implementing this at this stage:

In the end, the decision should be plugin and user specific, and I wouldn't make it at runtime. My suggestion is to discuss first what would be the workflow for a user, and implement easy things that can cover 90% of the use cases but can be implemented easily, rather than making the perfect solution that it's too complex to implement.

For instance, already thinking if removing the hash when folders are cleaned can already help in most cases?

Or creating helper functions and methods that e.g.:

  1. given some inputs, don't submit, but just return all possible calculations that are equivalent, and maybe some info on which remote data nodes would need to be checked
  2. this can be paired with some functions provided by the plugins, that can validate if a remotedata is still valid (e.g. check if certain files are there)
  3. the pairing can make sure that it checks all remotedata folders in a single connection, and cleans the caching when it discovers they are not valid anymore.
  4. also the first point of this itemised list could work in batch mode (e.g. you can return a list of remote data to check, but not giving the inputs of a calculation, but rather giving a Group, and the function will check the whole provenance of its nodes, and invalidate all RemoteData that are not good anymore).

This last one would actually be something convenient e.g. to run on a weekly (or even daily) basis. I imagine this usecase: I'm a developer, I'm using caching, and I just want to be sure I don't reuse nodes that are not valid. If cleaned remote dirs invalidate the corresponding calcs that's already good; in addition, I can run the script at point 4 above before I submit new calcs - this will take some minutes maybe (and it can skip those that have already been marked as "checked and invalid"), but after than I can just run and caching will do the right thing?

Finally, regarding suggesting not to use "too much" remote data - I'm not sure. We should probably explain the consequences to developers, but I feel that there are cases where this cannot (and maybe should not) be avoided, to avoid having too much data in the AiiDA repo, just for checkpoint files? E.g. for long-running (month-long) ab-inito MD runs, it should be easy for a user to store every, say, week, some checkpoint e.g. via stashing, but I wouldn't like the plugin to store all wave functions at every step by default. Not sure of a perfect balance, or if this can be made automatic (maybe it should be the role of the workflow on top to decide what to store and/or stash, and not the calculation, in this case?)

unkcpz commented 2 years ago

I just encounter another problem with caching that is a little different from the initial scenario I reported here in the first place. Now, my work chain which run PwCalculation and then PhCalculation is finished ok. I assume another run of the same workchain will all use the cached node and quickly finished without any problem. But the catch is even CJ1 -> RemoteData -> CJ2 is finished I have another calculation CJ1' which is identical with CJ1 but running standalone without the subsequent calculation CJ2. The second run of the workchain tries to use the CJ1' as the caching node which makes the workchain fail since I have already cleaned the remote folder of CJ1'.

I think the "backtracking" may be the only way to solve this problem. But maybe we can have another level of 'process' just for caching purpose for the calculation like PhCalculation which always required a remote_folder as the input parameter. We hash this node also with the information of its previous step process identification like the inputs or just the hash of the previous PwCalculation?

@sphuber This is the issue I mentioned in the aiida meeting, I look into it and find it is not the problem that presubmit will run even the node was cached. Sorry about the misleading.

In this case, it is confirmed that under the current implementation using the latest node for caching is not a good idea.

sphuber commented 2 years ago

Thanks for double-checking @unkcpz . Good to know that it isn't the case, because that would have been a bad bug.

Before looking into backtracking, I think the more direct and simple approach is still to simply actively mark a node that should no longer be used for caching using the is_active_cache property, which I added in 7cea5bebd15737cb7ae0f4f38dc4f8ca61935ae8. I know that this is not a perfect solution, because typically one will run into at least one failed calculation to figure out that a cache source should be marked as invalid, but I think the backtracking will be particularly complex to implement.

unkcpz commented 2 years ago

Thanks @sphuber. Do you think it is possible to implement a way to custom the hash for the CalcJobNode? In this case, it would be for the PhCalculation since the parent_folder always read from remote_folder of a PwCalculation. I customly hash the PhCaluclation node without using the remote_folder node but the define the hash my own way with using the hash of PwCalculation node?

sphuber commented 2 years ago

I am not sure I follow what you are trying to say @unkcpz . Even if it was possible to change how the hash of the PhCalculation is computed, that won't solve your problem. The problem you have is that in a workchain that runs a PwCalculation and then a PhCalculation, when the PwCalculation is executed, it is taken from the cache, but the source is not a PwCalculation that was run in that work chain before, but some other one (whose remote folder happens to have been cleaned). The decision of AiiDA to take a particular caching source for this PwCalculation is based on the hash of the PwCalculation and is completely independent of the upcoming PhCalculation. So changing the hash of the PhCalculation won't change anything. If anything you need to change the hash of the PwCalculation

unkcpz commented 2 years ago

Let me try to clarify my idea, it is really a bit tricky I think. But I am pretty sure in logic this will work.

Suppose the workchain contains two calculation. In the first running, a PwCalculation and then a PhCalculation which use the remote_folder of PwCalculation as the input parent_folder. So we have: PwCJ_0:hash_0 -> remote_folder_0 -> PhCJ_0:hash_0. In principle, if we turn the caching on and run the workchain again, the correct cached nodes are used. Then we have: PwCJ_1:hash_0[from using cache] -> remote_folder_0 -> PhCJ_1:hash_0[from using cache]

But in my case, there is another standalone PwCalculation which have the same hash value but with the different remote_folder output. That is: PwCJ_1':hash_0[from scratch] -> remote_folder_1 Then the problem happened when I run the workchain the second time with the expectation that the PhCalculation will using the caching, since now the remote_folder one of the inputs of PhCalculation is changed, the caching is not used for running this PhCalculation. So in order to let the PhCalculation find the desired node, my idea is to when create the hash for Phcalulation we don't use the parent_folder but use the hash of the previous PwCalculation step as an object to generate the hash for the PhCalculation. With this modification, the second run of PhCalculation will find the caching node to use.

If it is not clear described, maybe we can set a quick meeting to discuss it? @sphuber

greschd commented 2 years ago

use the hash of the previous PwCalculation step

In this case, aren't the two hashes of PwCJ_1 and PwCJ_1' the same? The "safe" option discussed previously of ignoring any PwCalculation whose RemoteFolder output no longer exists in caching should still work here -- but this obviously means that we will re-do the work even if the PhCalculation could also be cached.

unkcpz commented 2 years ago

Thanks for the comment! @greschd

In this case, aren't the two hashes of PwCJ_1 and PwCJ_1' the same?

Yes, those are the same, this is why I think we can use the hash of PwCalculation to make a hash of PhCalculation rather than use remote_folder which will differ.

but this obviously means that we will re-do the work even if the PhCalculation could also be cached.

I do not quite understand what you mean here.

sphuber commented 2 years ago

I see the problem now. It seems indeed that the hash of the remote folder of the PwCalculation is different in both runs and so the PhCalculation is not hit in the cache. The reason is because of the _get_objects_to_hash:

    def _get_objects_to_hash(self) -> List[Any]:
        """Return a list of objects which should be included in the hash."""
        assert self._repository is not None, 'repository not initialised'
        top_level_module = self.__module__.split('.', 1)[0]
        try:
            version = importlib.import_module(top_level_module).__version__  # type: ignore[attr-defined]
        except (ImportError, AttributeError) as exc:
            raise exceptions.HashingError("The node's package version could not be determined") from exc
        objects = [
            version,
            {
                key: val
                for key, val in self.attributes_items()
                if key not in self._hash_ignored_attributes and key not in self._updatable_attributes  # pylint: disable=unsupported-membership-test
            },
            self._repository.hash(),
            self.computer.uuid if self.computer is not None else None
        ]
        return objects

For a RemoteData node, this returns something like:

['2.0.0a1',
 {'remote_path': '/home/sph/.scratch/f0/83/18fe-de81-449f-9e50-f98287b5f695'},
 '076c4a75b0e7f3de89d0f21d3c4fad9c09fb83054f8e6c467b875095d93fd154',
 '6a3e7df5-79a9-45f3-ad4b-a25e08af92b2']

The last hash is that of the computer and should be the same if the calculation is run on the same machine (that makes sense). The one just before it is the hash of the repository contents, but that will be empty, because it is the local repository content, which is empty by definition for a RemoteData. Then there is the remote_path attribute and this is where the problem lies. This is going to be different for remote_folder_0 and remote_folder_1. Maybe we should simply remove this from the hash for RemoteData nodes?

I am trying to see if there are cases where we should actually include it because it should mean that the hash of the associated calcjob node is different if it was run in a different folder.

sphuber commented 2 years ago

If we decide that it should not be included, the fix is easy. We simply add

    @classproperty
    def _hash_ignored_attributes(cls) -> Tuple[str, ...]:
        return super()._hash_ignored_attributes + ('remote_path',)

to the RemoteData class.

greschd commented 2 years ago

I don't think we can change this globally. In essence, it would make all RemoteData hash to the same value, which produces false positives in the caching.

The behavior you're seeing here isn't completely unique to RemoteData, either: If there are two calcjobs with identical hash, which one is picked is essentially random. Of course, the outputs of these two calcjobs (if they weren't cache clones of each other) will differ, however slightly -- and hash to different values.

Because "regular" outputs cannot become invalid, this is less of a problem. But one could still imagine a scenario where one "branch" of calculations has a lot of downstream calculations already executed, while another has yet to perform these same calculations. Of course, it's more serious in the RemoteData case because the calculation will fail, instead of just having to re-do the work.

In either case, this situation will not arise if caching is always turned on.

sphuber commented 2 years ago

I don't think we can change this globally. In essence, it would make all RemoteData hash to the same value, which produces false positives in the caching.

Well, only those that were run on the same computer. But sure, I take your point. I guess that shows indeed that there is a good reason that we include the remote path. I now see @unkcpz 's point, that looking at the RemoteData the caching misses the hit, even though the associated PwCalculation would have had identical inputs and so would have matched. Still I am not sure if we can solve this by simply including the hash of the creator of a RemoteData that is used as input, because I think this merely kicks the can down the road. Because that creator may itself have a RemoteData as input and so its hash needs to consider the hash of its creator and so forth. There may very well be an end to this recursion, but I am not sure how feasible it is to implement and how reliable it would be. Also, there could be the possibility that a calculation has more than one RemoteData as inputs (currently probably not very common, but nothing stopping that from happening in theory). This would become quite complicated very quickly.

Of course, the outputs of these two calcjobs (if they weren't cache clones of each other) will differ, however slightly -- and hash to different values.

Is this a guarantee, or just something that is likely? That is to say, we cannot assume that the hashes of the outputs will be identical if the hashes of the calculations themselves are identical. What exactly in the outputs would be guaranteed to be different, even if "ever so slightly"?

In either case, this situation will not arise if caching is always turned on.

This I don't follow. In @unkcpz examples caching is turned on, but it clearly shows the problem. If the "wrong" node is taken as the source for the cache hit, the next calculation may not exist in the cache, whereas if another node would have been selected as source (randomly) it may well have existed.

greschd commented 2 years ago

Is this a guarantee, or just something that is likely? That is to say, we cannot assume that the hashes of the outputs will be identical if the hashes of the calculations themselves are identical. What exactly in the outputs would be guaranteed to be different, even if "ever so slightly"?

You're right, that's not a guarantee -- but very likely unless one is careful that the calculation is bit-wise reproducible.

In @unkcpz examples caching is turned on, but it clearly shows the problem.

Ah, you're right. I was thinking if caching is turned on there can't be two nodes with the same hash, otherwise they would be cache clones. But if they were run simultaneously, that can still happen.

sphuber commented 2 years ago

Ah, you're right. I was thinking if caching is turned on there can't be two nodes with the same hash, otherwise they would be cache clones. But if they were run simultaneously, that can still happen.

Ah I see now what you mean. It might indeed very well have been possible that for the individual and isolated PwCalculation (the one that wasn't run as part of the work chain) he turned off caching, or simply didn't enable it. Still this is a very likely scenario so poses a real problem.

greschd commented 2 years ago

Right, but in the grand scheme of things I am more worried about the fact that the subsequent calculation fails, and less about the fact that it might have to re-run if we implement any of the strategies discussed previously.

unkcpz commented 2 years ago

Thanks for the discussion @sphuber @greschd !! I am so happy you got my point and looks we are on the same page now.

But if they were run simultaneously, that can still happen.

That's true, I turn on the caching from the first place and the case happened because two same inputs calculations run simultaneously.

I don't think we can change this globally. In essence, it would make all RemoteData hash to the same value, which produces false positives in the caching.

I agree with it. Hashing RemoteData with only computer seems overdoing and problematic. It makes no sense that two RemoteData objects will be the same only if they are produced on the same computer.

Still I am not sure if we can solve this by simply including the hash of the creator of a RemoteData that is used as input, because I think this merely kicks the can down the road. Because that creator may itself have a RemoteData as input and so its hash needs to consider the hash of its creator and so forth. There may very well be an end to this recursion, but I am not sure how feasible it is to implement and how reliable it would be. Also, there could be the possibility that a calculation has more than one RemoteData as inputs (currently probably not very common, but nothing stopping that from happening in theory). This would become quite complicated very quickly.

Let me try to kick the can farther away. Since we find the problem is caused by the hash of RemoteData. I think instead of including the hash of the creator of a RemoteData to hash the PhCalculation, can we hash the RemoteData with the hash of its creator? It makes sense that the RemoteData is always created by some CalcJob, so the creator of the RemoteData can be a safe identifier bind to the RemeteData and make hash. Then even there are more than one RemoteData as inputs, we still can handle this situation.

greschd commented 2 years ago

can we hash the RemoteData with the hash of its creator

I like the idea, but haven't thought deeply about it yet. Since @sphuber brought up the issue of multiple RemoteData from the same calculation, maybe it should also include the link name?

sphuber commented 2 years ago

Since we find the problem is caused by the hash of RemoteData. I think instead of including the hash of the creator of a RemoteData to hash the PhCalculation, can we hash the RemoteData with the hash of its creator? It makes sense that the RemoteData is always created by some CalcJob, so the creator of the RemoteData can be a safe identifier bind to the RemeteData and make hash. Then even there are more than one RemoteData as inputs, we still can handle this situation.

Ok, I am trying to figure out whether this would solve your problem. To make discussion and thinking easier, I have created a sketch of the situation as I understand it: caching_problem

We have W1 that launches a PwCalculation (Pw1) which creates a RemoteData (R1), which is used as input for a PhCalculation (Ph1). Another PwCalculation (Pw2) is run outside of a workchain with the same input D1. The hash of Pw1 and Pw2 are identical, but the hashes of their RemoteData, R1 and R2 are different. Now the user launches a new workchain W1' which uses the exact same inputs as W1. The PwCalculation can now be cached from both Pw1 and Pw2 since their hashes are identical. Let's say that Pw2 is chosen (by chance). This produces RemoteData (R2') which has the same hash as R2 since it is a clone. Now the workchain moves on to running the PhCalculation, but it won't find a cache source, because no PhCalculation has been run yet with R2 as an input.

If this description is correct of the situation and problem that @unkcpz describes (please correct me if not the case), then I don't see how including the hash of the creator into the RemoteData will solve this issue. As long as you include the hash of the RemoteData itself, which is different for R1 and R2, then Ph2 will never be cached from Ph1, even if its predecessor Pw2' has an analogue (Pw1) with the same hash.

unkcpz commented 2 years ago

This exactly what the situation is, thanks for the figure. But I am not talk about including the hash of creator but replace the path which used now in hashing the RemoteData.

sphuber commented 2 years ago

I see now. Yes I think that would solve this particular case, but it is difficult to foresee if it might touch other pathways in a negative way. At least for now try adding the following to your aiida-core locally and see if it then fixes your problem:

class RemoteData(data):

    def _get_objects_to_hash(self) -> List[Any]:
        from importlib import import_module

        creator = self.creator

        if creator is not None:
            return [
                creator.get_hash(),
                import_module(self.__module__.split('.', 1)[0]).__version__, 
                self.computer.uuid if self.computer is not None else None,
            ]

        return super()._get_objects_to_hash()
ramirezfranciscof commented 2 years ago

One thing I was thinking for a more generic way to deal with these, could maybe be to allow users to insert their own selection functions for the cached calcjobs. So I'm thinking, we could add a calculation_selector property to the CalculationNode which would take functions with the following signature, and that would default to the following:

def default_selector( iterable_of_candidates: Iterable[CalculationNode] ):
   return next(iter(iterable_of_candidates), None)

Then the caching mechanism on the engine would (1) get the iterable of equivalent nodes, (2) run it through the calculation_selector (which can be set by the user to something different or default to the option above), and then (3) use the result of that if it is not None or proceed with running the calculation if it is None.

Not sure if passing an Iterable would be enough for keeping the current performance for the default behavior, but if it is not we can be more specific and pass a querybuilder instance directly.

def default_selector( candidates_query: QueryBuilder ):
   return candidates_query.first()
unkcpz commented 2 years ago

I'll keep on adding a comment for the newly found caching issue mentioned in this morning's aiida meeting.

When there are two processes running subsequently in a row, for example, I have run two processes CalcJob_A then CalcJob_B, B use the RemoteData of A as input. Assume A is finished ok and cached, B is not cached and will rerun. During the job submit phase, aiida will copy remote folder of A to a remote folder for B. However, the remote folder name of destination path (for B) is the same as previously running but not finished or cached B, it raises exception that 'directory exist and can't be override'.

I am sure this issue happened but I only noticed recently that when the other two caching problems (in this issue report) are workaround and suppressed.

Hi @sphuber, what tool you used for drawing the flow sketch above? I think it will be more explainable if I can summarize all three caching issues with flow sketch diagrams respectively.

sphuber commented 2 years ago

it raises exception that 'directory exist and can't be override'.

Can you please post the stacktrace

Hi @sphuber, what tool you used for drawing the flow sketch above? I think it will be more explainable if I can summarize all three caching issues with flow sketch diagrams respectively.

It is just an svg that I made with Inkscape and then converted to png. There is no automated tooling

unkcpz commented 2 years ago

Hi @sphuber, here is the stacktrace of the issue. To reproduce this you can:

04/06/2022 12:44:40 PM <629816> aiida.execmanager: [WARNING] tried to create path /scratch/jyu/aiida_run/6c/64/9293-9c94-47c4-b862-0c0316653e9a but it already exists, moving the entire folder to /scratch/jyu/aiida_run/lost+found/6c6492
93-9c94-47c4-b862-0c0316653e9a                            
04/06/2022 12:44:40 PM <629816> aiida.execmanager: [WARNING] [submission of calculation 71057] Unable to copy remote resource from /scratch/jyu/aiida_run/00/05/6851-175f-40fe-a53b-c88a643d5ae9/./out/ to ./out/! Stopping.
04/06/2022 12:44:40 PM <629816> aiida.engine.transports: [ERROR] Exception whilst using transport:
Traceback (most recent call last):
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/transports.py", line 110, in request_transport
    yield transport_request.future
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
    execmanager.upload_calculation(node, transport, calc_info, folder)
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 225, in upload_calculation
    transport.copy(remote_abs_path, dest_rel_path)
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/transports/plugins/local.py", line 553, in copy
    raise OSError('Source not found')
OSError: Source not found                                 

04/06/2022 12:44:40 PM <629816> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 3 of do_upload excepted, retrying after 80 seconds
Traceback (most recent call last):
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
    result = await coro()                                 
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
    execmanager.upload_calculation(node, transport, calc_info, folder)
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 225, in upload_calculation
    transport.copy(remote_abs_path, dest_rel_path)
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/transports/plugins/local.py", line 553, in copy
    raise OSError('Source not found')
OSError: Source not found                                 
04/06/2022 12:46:00 PM <629816> aiida.execmanager: [WARNING] tried to create path /scratch/jyu/aiida_run/6c/64/9293-9c94-47c4-b862-0c0316653e9a but it already exists, moving the entire folder to /scratch/jyu/aiida_run/lost+found/6c6492
93-9c94-47c4-b862-0c0316653e9a                            
04/06/2022 12:46:00 PM <629816> aiida.engine.transports: [ERROR] Exception whilst using transport:                                                                                                                                         
Traceback (most recent call last):
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 144, in upload_calculation
    transport.mkdir(calc_info.uuid[4:])
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/transports/plugins/local.py", line 192, in mkdir
    os.mkdir(os.path.join(self.curdir, path))
FileExistsError: [Errno 17] File exists: '/scratch/jyu/aiida_run/6c/64/9293-9c94-47c4-b862-0c0316653e9a'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/transports.py", line 110, in request_transport
    yield transport_request.future
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 89, in do_upload
    execmanager.upload_calculation(node, transport, calc_info, folder)
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 156, in upload_calculation
    transport.copytree(path_existing, path_target)
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/transports/plugins/local.py", line 646, in copytree
    shutil.copytree(the_source, the_destination, symlinks=not dereference)
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/shutil.py", line 557, in copytree
    return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/shutil.py", line 458, in _copytree
    os.makedirs(dst, exist_ok=dirs_exist_ok)
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/os.py", line 225, in makedirs
    mkdir(name, mode)                                     
FileExistsError: [Errno 17] File exists: '/scratch/jyu/aiida_run/lost+found/6c649293-9c94-47c4-b862-0c0316653e9a/9293-9c94-47c4-b862-0c0316653e9a'
04/06/2022 12:46:00 PM <629816> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 4 of do_upload excepted, retrying after 160 seconds
Traceback (most recent call last):
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/engine/daemon/execmanager.py", line 144, in upload_calculation
    transport.mkdir(calc_info.uuid[4:])
  File "/home/jyu/miniconda3/envs/aiida-sssp-dev/lib/python3.9/site-packages/aiida/transports/plugins/local.py", line 192, in mkdir
    os.mkdir(os.path.join(self.curdir, path))
FileExistsError: [Errno 17] File exists: '/scratch/jyu/aiida_run/6c/64/9293-9c94-47c4-b862-0c0316653e9a'

During handling of the above exception, another exception occurred:

....
....

FileExistsError: [Errno 17] File exists: '/scratch/jyu/aiida_run/lost+found/6c649293-9c94-47c4-b862-0c0316653e9a/9293-9c94-47c4-b862-0c0316653e9a'                                                                                         
04/06/2022 12:48:40 PM <629816> aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [WARNING] maximum attempts 5 of calling do_upload, exceeded                                                                                       
04/06/2022 12:48:40 PM <629816> aiida.engine.processes.calcjobs.tasks: [WARNING] uploading CalcJob<71057> failed   

It is just an svg that I made with Inkscape and then converted to png. There is no automated tooling

Thanks!

sphuber commented 2 years ago

The message tried to create path /scratch/jyu/aiida_run/6c/64/9293-9c94-47c4-b862-0c0316653e9a but it already exists is just a warning and is not actually causing the exception. What you are hitting their is the safety check I added in the upload_calculation that checks if the run directory on the scratch folder of the computer already exist. As you can see this folder, /scratch/jyu/aiida_run/6c/64/9293-9c94-47c4-b862-0c0316653e9a in this case, contains the UUID of the CalcJob. This should in principle guarantee that it is unique. However, there are some edge cases where it can already exist: for example when the upload_calculation is called but then interrupted without reaching the end (daemon shut down) and then when it restarts the daemon, it reexecutes the upload_calculation which will find the folder already exists. For safety it doesn't override it but moves it to the lost+found directory, as stated in the warning message.

The exception is actually caused later at line 225 (btw, what version are you running, because that is not the same in develop) and the problem is that the source files that is being copied no longer exists apparently. I can't say what could've happened here with more information, but please check that the remote_folder that is passed in the calculation and from which the content is copied, is still there and is not empty.

unkcpz commented 2 years ago

I am in version v1.6.5.

The exception is actually caused later at line 225

You are right, I check it and this is the initial issue of this thread and I forget to invalidate the pw caching of the CalcJob which has its remote folder cleaned and use it for Ph calculation.