Closed yakutovicha closed 4 years ago
We had an offline discussion on whether the current version of the BaseRestartWorkChain
present in aiida-qe plugin is general enough or it can still be modified.
Participants of the discussion: @sphuber, @greschd, @chrisjsewell, @felixboelle, and myself.
The latest version of the BaseRestartWorkChain
implements two mechanisms of handling problems: sanity checks and error handlers that are applied according to the following logic:
If calculation is finished without problems:
Apply sanity checks
if sanity check found a problem and fixed it:
Restart the calculation with updated settings/parameters
if sanity check found a problem and DIDN'T fix it:
Make the work chain stop with an error
if sanity check didn't find a problem
Make the work chain finish successfully
if the calculation excepted or killed:
stop the work chain
if calculation finished, but has a non-zero exit status:
Try to apply error handler
if error handler could fix the problem:
Restart the calculation with updated settings/parameters
if error handler COULDN'T fix the problem:
Make the work chain stop with an error
Topics for the discussion were:
priority
of the register decorators keyword onlyexit_codes
argument to register_process_handler
to allow to filter for set of specific exit codes. If the exit code of the current child process is not in exit_codes
then the handler is skipped entirely.The rough logic you sketched is missing the part where if a failed calculation is not handled it will still be restarted and a second consecutive unhandled failure will cause the workchain to abort. In addition I moved the check for excepted
and killed
to the top of the checks.
I currently have a version that satisfies two of the four checkboxes. I will open a PR already so people can start to test with their respective plugins. The other two checkboxes are features that will be added in separate commits.
The ErrorHandlerReport
named tuple is changed. Originally the first element was a boolean named is_handled
. After discussion I removed it because we decided that returning an ErrorHandlerReport
in and of itself should signal that the error was handled. However, when I read my original docstring, I explicitly mentioned that if the conditional of the handler was matched to always return a report instance, even if nothing was handled. This way it would still engage the two consecutive failure mechanism, while also recording the fact that the condition of the handler was triggered. Note that the registration decorators will only add a reference to the extras of the node if an instance of ErrorHandlerReport
is returned. Do we need this level of control? i.e. should I add the explicit is_handled
boolean again? @greschd @yakutovicha @chrisjsewell
I've spent some thoughts on merging or not sanity check and error handlers and want to reiterate the discussion. Just to state things clearly: I am pro merging them. In this post, I share my opinion on why this should be done.
The main argument in favour of keeping them is " ...the fundamental difference between error handlers and sanity checks. The error handlers should be applied when the process has returned a non-zero exit status, while the sanity checks must be applied otherwise." I really don't understand what advantage does the existence of such a fundamental difference bring to us and to a user. @sphuber can you maybe provide an example where this pays back?
Another argument I have is: sanity checks alone can do the job if one always applies them. If I want to apply a particular check only if everything alright, I can add a simple if-statement within the check. Or, one can add a decorator of a sort @apply_when_finished_ok
, @apply_when_nonzero_exit
and etc.
Another question: if I want to apply certain sanity checks even if the job has non-zero status, I can't do that, right? If combining sanity checks and error handlers the problem wouldn't be there.
In addition to that, I see a lot of code duplication in the current implementation of sanity checks and error handlers: the sanity_checks and handler callers, the sanity checks and handler reports, (possibly) separate entry points, the decorators.
I also think the logic in applying sanity checks and error_handlers can be unified. Something like
action_was_taken, exit_code = apply_checks_and_handlers()
if action_was_taken and exit_code == 0: do restart elif action_was_taken and exit_code != 0: report "could not fix the problem" stop elif the process has non zero exit status: declare unhandled error
**Summarizing**. I still don't see a need for having two separate concepts, I don't see the value which is added by such a separation. The same can be done with sanity checks only. This approach will have the following benefits: (a) more straight logic, (b) less code, (c) more flexibility (see point 3).
Mentioning @giovannipizzi if he has an opinion on that.
Think about it like this: if a calculation is successful, why would you consider any error handlers? Conversely, if a calculation has not failed why would you need to handle any errors. I am convinced that the distinction is important to force developers to think about the structure of their code and not muffle everything under a single denominator.
If you really have a situation where a piece of code makes sense as both an error handler as well as a sanity check (which I still doubt) you could still use them as both. With the input namespaces that I will add, you can simply pass it to both.
Yes of course sanity checks could do everything, but my point is it shouldn't, because it would not be a sanity check. And your @apply_when_finished_ok
and @apply_when_nonzero_exit
are exactly the @register_sanity_check
and @register_error_handler
.
You can pass the same function both as an error handler as well as a sanity check, there is no limitation.
I keep them separate for now, because they are not fully identical and might need even further differences. The report namedtuples may be merged potentially.
Unless there is an urgent need to emerge this, I feel this topic (separation VS merging) merits a discussion involving more people (unless already discussed at the coding week). Since this has been used by other plug-ins, what do the other developers who used it think?
Also, I guess that in the pseudo code above, if the calculation is killed then it's not restarted once more?
Also, I guess that in the pseudo code above, if the calculation is killed then it's not restarted once more?
Yes, in the latest implementation if the calculation is killed or excepted - the work chain stops.
The distinction between error handlers and sanity checks has been there for a while (a year I guess but not in the first version) and most plugins use this version. The version I am adding here generalizes the concept a bit more such that the sanity checks can now also be registered through decorators.Currently there is a single method that should be overridden. The merits of separation was discussed in Fiesch with @greschd and @chrisjsewell and they agreed that the difference was useful. Correct me if I remember wrong here, guys.
To avoid misunderstanding on terminology the joined concept of error handlers
and sanity checks
I will call fixers
.
The merits of separation was discussed in Fiesch
Indeed, I also agreed there that such a distinction is useful. But when I looked at the code again my doubts came back. I am very sorry to bring back this discussion but I really feel a need that the final decision will be a result of deep consideration. Moreover, if one has doubts later about the design decision made - one can always come back to this issue and understand the reasoning behind.
Think about it like this: if a calculation is successful, why would you consider any error handlers?
No question about this - I don't want to run error handlers if there are no errors. But in joined fixers
concept one could bypass this by adding an if
statement to let a particular fixer run only if there is a non-zero exit code. If one wants to be more advanced, one could even add a decorator that does this (I mentioned this in my previous comment)
Consider also the following. Let's take a very simple example - convergence. Convergence is a very well defined concept, but for certain tasks, there is no rigorous way to check it. It means that for one type of codes one could detect an "unconverged error" but for others (like Monte Carlo) the determination of convergence becomes very much system dependent. In the first case, one would call the error handlers and in another case, one would call the sanity checks. To me, this, essentially, means that there is no fundamental distinction between sanity checks and error handlers as one type of problem is solved by either of them.
The place where such a distinction is created is at the parser level. And, again, the logic becomes very much like this: if my parser is very advanced and can parse many things - I will have a lot of error handlers and a very little number of sanity checks. If my parser is not that advanced - the situation will become a reverse.
The place where such a distinction is created is at the parser level. And, again, the logic becomes very much like this: if my parser is very advanced and can parse many things - I will have a lot of error handlers and a very little number of sanity checks.
You hit the nail on the head there for me. I think the calculation and parser are important and should do as much as reasonably possible to give a complete and useful picture through the exit code of what happened. I want to prevent that the BaseWorkChain
concept gets "abused" as a single massive point where everything is dealt with. This will make the calculation job nodes information-less.
If my parser is not that advanced - the situation will become a reverse.
You are right, but the conclusion here then should not be "we get rid of any distinction and deal with all things in one amorphous concept in the wrapping work chain" but rather I will improve my parser and make it deal with the things that it should be responsible for. I am not going to just cram everything into the base work chain. I am convinced that if one does not take the time to think about these things properly here of who is responsible for what and what the intended goals are of the various components, one will end up writing monolithic ill-defined and non-reusable code. So while the separation of the concepts may seem arbitrary I think it is exactly important to force people to think about their design and what it is they are trying to do.
I think the difference between error handler and sanity checks can be expressed in terms of what they do to the exit codes:
Accordingly, I'd suggest the control flow to be (in that order):
Of course one could think of an error handler that "handles" the error a sanity check raises, but I think that would be a code smell.
Without the distinction, one could imagine a very convoluted back and forth between zero and non-zero exit codes, depending on the priority of the individual fixers.
There is a slight problem with your reasoning I think
error handlers can turn a non-zero exit code into a zero exit code
yes that is true, but that is only indirectly. In this context, "handling the error" is merely adapting the inputs and saying the inner loop to launch another calculation. So I don't think it makes sense to change the logic because after the error handlers have been called, the exit code of the last calculation will never have been changed.
Maybe I'm misremembering this, but didn't we discuss a case where an error handler decides "it wasn't an error after all", and sets the exit code to zero?
One could argue it's the parser's responsibility to give the correct exit codes, but the problem is that this is not configurable. If I were to decide a certain kind of error is benign in my application, I could however add an error handler that just swallows it (through the input or entry points).
Maybe I'm misremembering this, but didn't we discuss a case where an error handler decides "it wasn't an error after all", and sets the exit code to zero?
It cannot and should not change the exit code of the calculation job itself, but for the workchain itself, yes that is true and that is possible, the error handler could simply set self.ctx.is_finished
to True
and the workchain will go to the results
step. I though about formalizing this a bit more and doing this automatically if the returned ErrorHandlerReport
has ExitCode.status == 0
, but then the question is raised whether that is also the exit code that should be set on the workchain itself. I have a use-case though where I would still like to give a non-zero exit status to the work chain, which, just like that of the calculation job, represents a benign "error" and this would not be possible if I am forced to pass zero in order for the cycle to stop and go to results.
In this case, shouldn't the sanity check still run? There could be other issues with the calculation than the one ignored by the error handler, no?
I think the calculation and parser are important and should do as much as reasonably possible to give a complete and useful picture through the exit code of what happened.
This is again an open question: what exactly the parser is responsible for? The problem with such a definition that it has no clear borders: "do as much as reasonably possible", "complete and useful picture" do not say anything to me. To me, the task of the parser has always been a single one: "parse what is outputted". This essentially means that if, for instance, CP2K would produce output in JSON format, I would happily replace the whole parser with a single line of JSON loader. I would really be careful with introducing logic in the parser because it can easily become a monster.
For the work chain, the situation is different for me, though. I think the work chain derived directly from the BaseRestartWorkChain
must be devoted to achieving a specific goal and must have all the necessary logic to do so.
You are right, but the conclusion here then should not be "we get rid of any distinction and deal with all things in one amorphous concept in the wrapping work chain" but rather I will improve my parser and make it deal with the things that it should be responsible for.
This again hits the same issue: what is the responsibility of a parser in your opinion? To me - I would happily parse a warning produced in the code output, but I would never dare to start implementing some convergence checks in the parser, even if it is straightforward. I would rather put it at the level of some ConvergeWorkChain. The reasons for this is very simple - I want to keep the parser as general as possible and I want to make work chains as specific as possible.
You keep making my point for me though. Your idea of and approach to the parser seems to be that it should be as "dumb" as possible merely transforming some data in nodes:
This essentially means that if, for instance, CP2K would produce output in JSON format, I would happily replace the whole parser with a single line of JSON loader. I would really be careful with introducing logic in the parser because it can easily become a monster.
This is exactly the approach that I take issue with because I think this reduces the usefulness of the CalcJob
. If you do this, all your calculation jobs will have exit status 0
and appear as though successful, even those that clearly failed with an error. This destroys the queryability of your data. This approach should be prevented, however, by removing the distinction between sanity checks and error handlers we are directly stimulating it as you are showing in your reasoning.
Taking the specific use-case you keep bringing up: a MC simulation that cannot determine whether the asked for convergence is achieved. If that is really not possible for the general case, then I agree the parser should indeed just return 0
if there are no other obvious problems that could be detected, that is. The next step is then also very clear: this is the perfect example of when to use a sanity check. The parser told that the calculation ran fine, but the WorkChain
implementer knows that it is limited and needs to check if the calculationreally is converged. This is literally the definition of a sanity check.
This is exactly the approach that I take issue with because I think this reduces the usefulness of the CalcJob. If you do this, all your calculation jobs will have exit status 0 and appear as though successful, even those that clearly failed with an error. This destroys the queryability of your data.
That is not true. Read my clarification:
To me - I would happily parse a warning produced in the code output, but I would never dare to start implementing some convergence checks in the parser, even if it is straightforward.
I am fine with parsing whatever the code thinks is not good and return a non-zero exit status for that. I just don't want to introduce additional checks in the parser
I am fine with parsing whatever the code thinks is not good and return a non-zero exit status for that. I just don't want to introduce additional checks in the parser
And you are not forced to. You know the parser is limited in certain things and it will not always give the desired output even if it says there were no problems. This is exactly when you implement a sanity check
I think you have two different points of view, I don't think either of them is wrong, it's just a choice of where to put the logic. I don't think further discussion on this aspect here will be useful as you have made your points clear. If anybody else has a feedback/point of view, please give your opinion. I just mention that crystallizing this into AiiDA means pushing for one model over the other, so it would be good to hear from as many people as possible who have used them. Ideally, to help with the discussion, taking a simple example implemented in both ways (even as pseudo code) would be useful to see the difference in practice - Sasha, would you be able to make such an example?
Since development is an iterative process, why don't we take a compromise solution?
I propose to implement all three things in the inspect_process
function: error handlers
to handle non-zero exit codes, sanity checks
that run if no problem detected and fixers
that always run? And we let the time decide which approach is better.
Alternatively, we can have two inspect_process
functions (with a slightly different name) where one runs fixers
and another runs sanity checks
and error handlers
.
I would be happy with any of those approaches. Also happy contribute the fixer code
Since development is an iterative process
Sure thing, but aiida-core
is not the place for that. That is why I put it in aiida-quantumespresso
to experiment. The majority of plugin developers have been using it contently the way it is implemented there. I saw the merits therefore of centralizing it, but it is pointless to put a Frankenstein's monster mish-mash of approaches in aiida-core
because there we will be held to strict backward-compatibility requirements with respect to any changes. If we can't agree, apparently it is not ready to be put in aiida-core
and we just keep using what we have so far. Luckily the power of AiiDA is that you can customize at your heart's content.
Regardless of the disagreement over the specific error handler / fixer interface, I would argue that the benefits of including it in aiida-core
outweigh these concerns:
aiida-core
rather than having to copy it from some other pluginI don't have a particularly strong opinion either way on the "error handler + sanity check" (EH+SC) vs. "fixers" (F) debate. As far as I see it, EH+SC is a more "restricted" scheme, trading absolute freedom for something that is easier to reason about. @yakutovicha, do you have an example use-case that can not be achieved with EH+SC, but can be done with F? That could help us think about the issue more clearly, and maybe come up with a satisfactory scheme that accommodates it.
Regardless of the disagreement over the specific error handler/fixer interface, I would argue that the benefits of including it in aiida-core outweigh these concerns
I agree with that. If I didn't manage to convince you guys that fixers are better than error handlers/sanity checks, please go ahead finalizing the PR with the current approach
As far as I see it, EH+SC is a more "restricted" scheme, trading absolute freedom for something that is easier to reason about. @yakutovicha, do you have an example use-case that can not be achieved with EH+SC, but can be done with F?
I think we agree that with both approaches everything can be done. The key difference here is how simple they are to use, which ultimately defines the user's satisfaction and happiness :). To me, fixers give to a user an ultimate control to the work chain and are quite clear in terms of what they do: they run after the process is finished and they check for a problems and try to fix them if possible.
@sphuber 's implementation essentially divides them into two blocks: one that runs if everything is alright, and one that runs if there was a non-zero exit code. I wanted to challenge this separation since (a) it is one possible way to split things in two and (b) it introduces essentially two self-excluding ways to run fixers, but most of the code is just the same which is puzzling me. Also, by this separation (according to @sphuber) we force the plugin developers to make better parsers. This I also doubt because I don't think that "clever parser" is necessarily the best approach (see my reasoning above). I would prefer to have the ultimate freedom to shape things how I want and according to my philosophy (AiiDA is also about freedom, right?). And fixers give me such freedom.
Now, since the simplicity is one of the most essential characteristics of a good code, I want to share with you the implementation of fixers within the inspect_process
function:
for fxr in sorted(self.inputs.fixers.keys()):
module_path, fixer_name, *args = self.inputs.fixers[fxr]
module = importlib.import_module(module_path)
fixer = getattr(module, fixer_name)
fixer_exit_status = fixer(self, calculation, *args)
# fixer_exit_status is None, if there was no action
self.is_finished = True
if fixer_exit_status == 0: # exit status 0 means there was a sucesfull action
self.is_finished = False # do restart
else:
self.report("Something bad happened, blah-blah-blah")
return fixer_exit_status
One could add entry points here as well so that module_path
first checked in the entry points, and if the entry point does not exist, it will be checked in python modules. Now, look at the implementation in the #3654 and compare. I think it speaks volumes.
Obviously, sometimes we want to restrict the execution of fixers to particular cases. For this, we can come up with a nice set of useful decorators for fixers:
--no-decorator-- # sanity check and error handler
@run_when_is_finished_ok # sanity check
@run_when_non_zero_exit_code # error handler
@run_when_exit_code(exit_code) # error handler that runs only for a given exit code
etc...
this way everything will be covered and we won't restrict the fixers to be split into two categories named sanity checks and error handlers.
Thanks for the examples Sasha. I have updated the PR with a commit that unifies the error handler and sanity check concepts. I still have to check whether this covers all use cases I have and if it gives enough freedom to the user to determine the flow of the work chain. Let me know what everyone thinks of this. Note that docstrings still need to be updated, but will do this once we decide on this.
To come back to some of your comments:
I think we agree that with both approaches everything can be done.
Then we agree that it is merely a different way of doing things, but you claimed in the beginning that my approach didn't allow you to do certain things, but if I am not mistaken you still have not given a concrete example of this.
The key difference here is how simple they are to use, which ultimately defines the user's satisfaction and happiness :).
Simplicity is certainly good but not at the cost of everything else. Providing users the tools that will automatically steer them in the right direction is also important. There are already quite some examples in AiiDA of this approach.
Also, by this separation (according to @sphuber) we force the plugin developers to make better parsers. This I also doubt because I don't think that "clever parser" is necessarily the best approach (see my reasoning above).
Note that the base restart concept is not just about wrapping calculation jobs but any process. I think the exit code of a process should mean something and by throwing all codes on the same heap you are essentially saying, I don't care what the process told me. Now sometimes you can have a good reason to this (sanity checks) but I think in practice this should be an exception. This is why I advocate that parsers provide as much information as is reasonably possible. All of this is to work towards a system where responsibilities of different process are well defined which leads to, I am convinced, more robust and modular workflows.
I would prefer to have the ultimate freedom to shape things how I want and according to my philosophy (AiiDA is also about freedom, right?). And fixers give me such freedom.
Even with the fixers there would still be limitations, just different ones. But even then, one could still override the inspect_process
method and do whatever one wants. This really is not a question of restricting freedom just for the sake of restricting it.
Now, since the simplicity is one of the most essential characteristics of a good code, I want to share with you the implementation of fixers within the
inspect_process
function: One could add entry points here as well so thatmodule_path
first checked in the entry points, and if the entry point does not exist, it will be checked in python modules. Now, look at the implementation in the #3654 and compare. I think it speaks volumes.
Yes, code simplicity is in principle a good thing to strive for, but as said before, not at the cost of everything. Reducing the complexity of code is useless if you do not keep the same logic that is encoded by it. Your code is indeed simpler, but that is in a big part just because it oversimplifies the logic.
Let's see if with the updated PR we can all do what we would like to do and go from there.
I have updated the PR with a commit that unifies the error handler and sanity check concepts.
Thanks, @sphuber for merging them. Looks very nice, simple and clear to me!
... but you claimed in the beginning that my approach didn't allow you to do certain things ...
My only comment in this direction was that I wasn't sure how can I force a function to be always executed (independently of the process exit code). But you answered to that:
You can pass the same function both as an error handler as well as a sanity check, there is no limitation.
I have a few technical questions, but I will ask them in the PR review. Conceptually it now looks perfect to me.
Currently, the
BaseRestartWorkChain
work chain is used in aiida-qe, aiida-cp2k, aiida-vasp, aiida-crystal17, aiida-raspa plugins. They've all started from the same source (aiida-qe), but then slightly diverged.Since the functionality of the
BaseRestartWorkChain
is useful for different packages, it makes sense to move it to one central place (aiida-core).