How should we handle errors and warnings ejected from VASP

aiida-vasp / aiida-vasp

A plugin to AiiDA for running simulations with VASP

https://aiida-vasp-plugin.readthedocs.org/en/latest

Other

48 stars 45 forks source link

How should we handle errors and warnings ejected from VASP #346

Open espenfl opened 4 years ago

espenfl commented 4 years ago

This issue handles the discussions and decisions that are taken in order to offer the functionality to recover from critical errors, adjust a workchain course based on less severe warnings etc.

A few preliminary points that we should discuss in more detail:

How to parse the stdout/err of VASP (which usually ends up in the scheduler output file).
How to parse any salvageable output from VASP as we might use the results directly, or the results to enrich the output from the stdout/err.
Need a flexible way to act based on the data obtained above.
How to document the steps and their interdependency in a clear and understandable manner? Is it possible or just too complex?
How to log/report the steps/procedure taken?
Can we always reproduce a calculation where these steps are taken?
Should we make a workflow that handles the flow of steps taken?

espenfl commented 4 years ago

* How to parse the stdout/err of VASP (which usually ends up in the scheduler output file).

See #345. We should probably let parsevasp take care of the parsing itself in case we or others have a use for it outside aiida-vasp.

espenfl commented 4 years ago

* How to document the steps and their interdependency in a clear and understandable manner? Is it possible or just too complex?

Should we use Graphviz? Other tools that makes it easy to produce a diagram in Sphinx that can be versioned?

espenfl commented 4 years ago

* Need a flexible way to act based on the data obtained above.

To what extent should we use the error handler concept?

espenfl commented 4 years ago

* Should we make a workflow that handles the flow of steps taken?

aiida_core have the concept of process handlers, but it is currently unclear when we should use these and when to inline actions that need to be taken due to some raised error, for instance as a dedicated workchain.

It would maybe make sense to create a separate modular workchain/flow to each problem so that the action become a part of the provenance in an intrinsic manner. This is also what was originally the though of the VerifyWorkChain, a workchain that would house all kinds of calls to workchains that solve all kinds of issues in a modular way, finally calling the VaspWorkChain again when you have fixed your inputs, options, metadata etc. As such VaspWorkChain is mostly a slave that only prepares for the execution of the VaspCalculation which calls VASP. If we follow this path, this raises another question, where do we then draw the limit between workflow and code specific workchains? Right now and with the introduction of the ParametersMassage concept it made it natural that we only have the VASP specific code in VaspWorkChain. But if we introduce code dependent measures at the VerifyWorkChain level, that concept needs to be lifted one level. But that is probably okey? Also, one could consider to segrate and keep non-code specific action in a separate set of workchains so that they are in principle code agnostic.

zhubonan commented 4 years ago

Should we make a workflow that handles the flow of steps taken?

I think it is probably best to leave the part of making non-code specific workchains later - one can always write adaptor to workchains to achieve that later on, and we should focus on code-specific part for now. At a very low level, each code can work quite differently. For example, in VASP you need to do two calculations for variable cell relaxation or bandstructure or phonons, but in CASTEP they can be done in a single calculation without a problem.

It would maybe make sense to create a separate modular workchain/flow to each problem so that the action become a part of the provenance in an intrinsic manner.

In general, I think it is best if we don't use too many levels of workchains, as the graph can start to get really complicated and difficult to understand/query/view. Testing can also be a problem and the code base can become difficult to maintain.

Need a flexible way to act based on the data obtained above.

My suggestion would be to add the error handler at VaspWorkChain level - which reads the parsed warnings or the exit codes (only for the critical error) and acts accordingly. Things like NELM being exceeded can be handled there, as least this is what I do for my CASTEP plugin. This way, VaspWorkChain has an actual meaning - it is the base workflow that guarantees (or try to guarantee) the calculation to finish as intended.

How to log/report the steps/procedure taken?

In terms of making error handling transparent, I think making verbose logging using self.report (as the default) is sufficient. Add another level(s) of WorkChain seems a bit unnecessary.

Can we always reproduce a calculation where these steps are taken?

Even we do the error handling, each calculation (with erros) are still reproducible I think. Of course, if we actually rerun, some error may not pop up, but it is still the same correct result that is obtained. s

espenfl commented 4 years ago

Should we make a workflow that handles the flow of steps taken?

I think it is probably best to leave the part of making non-code specific workchains later - one can always write adaptor to workchains to achieve that later on, and we should focus on code-specific part for now. At a very low level, each code can work quite differently. For example, in VASP you need to do two calculations for variable cell relaxation or bandstructure or phonons, but in CASTEP they can be done in a single calculation without a problem.

I was thinking more of if we should use the handler concept or more inline coding in a workchain/flow for addressing error/warnings etc. A lot of them, we already have strategies that usually mitigate the problems. Initially I favoured the handler concept, but at least for more code specific stuff, I am not convinced and have started to lean more towards having it in a workflow/chain. What do you think?

espenfl commented 4 years ago

It would maybe make sense to create a separate modular workchain/flow to each problem so that the action become a part of the provenance in an intrinsic manner.

In general, I think it is best if we don't use too many levels of workchains, as the graph can start to get really complicated and difficult to understand/query/view. Testing can also be a problem and the code base can become difficult to maintain.

I agree, but we also would have to realize that the problem indeed is quite complicated and getting a simple graph for a complicated problem is never going to be easy. So some level of complexity have to be expected on the graph. Maybe in the future we could make a lot of this a bit easier with better tooling for graph exploration, for instance functionality to turn on and off error/warning correction steps taken during visualization etc.

Testing more complicated workflows is going to be a serious problem. However, with the recent ideas of using the hash functionality to do this, I am quite positive about this. It remains however to see how this will work in practice.

espenfl commented 4 years ago

Need a flexible way to act based on the data obtained above.

My suggestion would be to add the error handler at VaspWorkChain level - which reads the parsed warnings or the exit codes (only for the critical error) and acts accordingly. Things like NELM being exceeded can be handled there, as least this is what I do for my CASTEP plugin. This way, VaspWorkChain has an actual meaning - it is the base workflow that guarantees (or try to guarantee) the calculation to finish as intended.

This is what I initially was thinking, but when inspecting some of the workflows we have already for a few of the errors, it is quickly 5-10 different VASP runs. Some of these operations might make sense to put into its own workchain as they are general and used as a component for several other mitigation strategies. A very simple case is node distribution which is used to mitigate several problems.

I agree, that for the more simple problems, which is more of a one off situation, we might as well just put it in as a handler at the VaspWorkChain. But for the stuff that needs to tests different things (and there is not a clear path for each problem) I think a WorkChain might be more suitable. But it is not yet clear to me how we should call those and if that should be done above VaspWorkChain, say in VerifyWorkChain or from VaspWorkChain. Maybe it makes sense to put the VerifyWorkChain below the VaspWorkChain as this is sort of the main entry and exit point. Originally, the VerifyWorkChain was going to check non-code specific things, like symmetries, some physical principles that should be obeyed etc. (and that is why it is now above the VaspWorkChain) but that is going to take more time and the fundamental error correction stuff is much more pressing. What do you think? @atztogo what about you?

espenfl commented 4 years ago

How to log/report the steps/procedure taken?

In terms of making error handling transparent, I think making verbose logging using self.report (as the default) is sufficient.

I agree. And we will have to raise an exit code (or at least for some of the issues) which we can enrich a bit.

espenfl commented 4 years ago

Can we always reproduce a calculation where these steps are taken?

Even we do the error handling, each calculation (with erros) are still reproducible I think. Of course, if we actually rerun, some error may not pop up, but it is still the same correct result that is obtained. s

This is maybe not always true as it depends on how you run the calculation on the cluster. If you also make sure the node distribution is the same (and that the system does not change e.g. cpu/node ratios etc.) and you are on the same computer I think this is mostly true. I guess this is also a more general question; should we expect reproducibility on different computer/setups (meaning e.g. cpu/node distributions)? I think the answer is yes, as long as the operations are not in principle dependent on the computer/setups, but in practice they are, especially for parallel executions. This is also something we should discuss with aiida_core in a broader context. @atztogo do you have some thoughts on this as well?

But, as long as the calculation finishes, the result should be the same within the numerical boundaries set by the code, libraries and architecture (which today is mostly the same...we have seen some funny things on ARM platforms, but that is known).