`BigDftCalculation`: return exit code if the calculation ran out of walltime

sphuber commented 3 years ago

Currently, if the job is killed by the scheduler, the parser doesn't return an error and so the calcjob is marked with exit code 0 indicating a success. It would be good to add an exit code for this. I can recommend to use ERROR_OUT_OF_WALLTIME(400) which multiple plugins have adopted as a standard now. This will then also make it easy to add a process handler to the baseworkchain that will simply restart the calculation in this case. But even without this, it will be immediately obvious that the calculation actually failed and a user is not misled thinking everything was fine

adegomme commented 3 years ago

How do other plugins detect timeout? Because it's really scheduler-dependent in my understanding, and we don't have any specific timeout message in BigDFT output (it does not know it's been killed, same as OOM). The message in stdout/stderr files depends on the used scheduler/machine, so I wouldn't rely on parsing it, or have each plugin do it.

I see that ERROR_SCHEDULER_OUT_OF_WALLTIME was introduced in aiida-core for slurm a few months ago, but it seems to be not triggered in this case and not used by other plugins? As we use slurm in quantum-mobile, I would expect to have it reported, so that we can eventually catch it and say if we want to restart or not, or just let it display a relevant message.

sphuber commented 3 years ago

You basically have a few options:

Rely on the scheduler error parsing that comes with the scheduler plugin. As you already mentioned, since some while, the scheduler plugins can already parse their output and report some exit codes. This is implemented for the SlurmScheduler for the out-of-wallclock error, so that should have triggered in your case. However, we designed it such that a Parser can always decide to parse the information, and if applicable, return a more specific error code. That means, if the Parser returns an exit code, then that overwrites any exit code that the scheduler parser may have set. Since the BigDFTParser always explicitly returns ExitCode(0), this will always override the parser error. Have a look at the documentation on scheduler parser errors, to see how to inspect them in a parser and work with them. The only thing that surprises me is that I think in any case a log message should be recorded if the scheduler parser detects a problem, so there should be a trace of it in the logs even if it gets overwritten by the parser. I will perform some tests to see what the problem is here.
Alternatively, if the code supports it, you can set a soft limit that is lower than the requested walltime from the scheduler, which when exceeded will cause the code to shut down nicely, writing all necessary info to disk, such that it can be restarted from. This is what we do for Quantum ESPRESSO: it allows setting an input parameter max_seconds which we set to 95% of the scheduler walltime. This is done by the PwBaseWorkChain. This way, if the time is exceeded, the code will stop and print a line the output, which the parsers detects and returns the corresponding 400 exit code.
Despite the soft limit described above, it is still possible that the scheduler kills the job and the output file is not fully written. This is what is happening in your case, and this also can happen for Quantum ESPRESSO. What I do here is make sure that the output file was fully written. At least for QE this is possible as the code will write some line at the end of the run. If this isn't there, it was clear something bad happened. In this case it is possible we cannot detect the exact reason for the failure, but I still have the parser return an exit code in the 300's. These I reserve for critical errors of potentially unknown nature, but that almost certainly mean the desired result was not achieved and most likely it is impossible to restart from this calculation. Is this not possible for BigDFT? Is it impossible to look at the output yaml of a run that was interrupted and see that it was interrupted?

sphuber commented 3 years ago

The only thing that surprises me is that I think in any case a log message should be recorded if the scheduler parser detects a problem, so there should be a trace of it in the logs even if it gets overwritten by the parser. I will perform some tests to see what the problem is here.

The reason this didn't get caught is because the scheduler parser output feature is only triggered when the detailed job info is available, which for SLURM is only possible when the accounting option is enabled. This is not the default for an installation as it requires more setup and databases, and so Quantum Mobile does not have this. The reason it needs that info and doesn't just parse it from the stderr, is because it is not sure that the format of the error messages in the stderr are consistent across versions and so the parsing may be very unstable.

adegomme commented 3 years ago

I will have to go for the generic failure indeed, as BigDFT does not offer a walltime limit to exit gracefully .. All we can check is that the final lines are not printed. Sometimes, log is still valid if only the cleanup part was cut, that was the reason why I didn't put a crash in the first place, to still provide outputs if possible.. But that was a bad idea, indeed. And this leads to the restart issue, as OOM kills or other issues are not restartable, while walltime should be, as you said... I will maybe still add an attempt at parsing stderr/stdout to find keywords But indeed this could be better done at the scheduler level, even if naive and imperfect (It's better to catch some and allow them to restart than to catch none and prevent restarting, even if not perfect, I think).

I did a little digging, and from what I see, common walltime exceeded messages for supported schedulers are : Slurm : "CANCELLED AT xx DUE TO TIME LIMIT" https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/req.c#L747 PBS : "PBS: job killed: walltime xx exceeded limit " https://github.com/openpbs/openpbs/blob/master/src/resmom/mom_main.c#L6040 SGE/UGE might be "exceeded hard wallclock time" https://github.com/gridengine/gridengine/blob/master/source/daemons/execd/msg_execd.h#L220 LSF : "TERM_RUNLIMIT: job killed after reaching LSF run time limit" https://www.ibm.com/support/pages/termrunlimit-job-killed-after-reaching-lsf-run-time-limit-exited-exit-code-140 Torque : "walltime xx exceeded limit xx" https://github.com/adaptivecomputing/torque/blob/master/src/resmom/mom_main.c#L3864

Maybe storing some of these regexp in each schedulers plugins and checking them would still be better than nothing ?

sphuber commented 3 years ago

Sometimes, log is still valid if only the cleanup part was cut, that was the reason why I didn't put a crash in the first place, to still provide outputs if possible.. But that was a bad idea, indeed.

If you can parse useful information and attach it as output nodes, I would always do this. This doesn't prevent you from returning a non-zero exit code afterwards. This is what I do for aiida-quantumespresso. Note that a non-zero code doesn't even necessarily mean that the calculation "failed". I have some codes that merely indicate some special edge case that occurred but the calculation and its results can be used in many cases just fine. The exit code is just there to easily distinguish these cases and allow anyone that needs or wants to react to it, do so easily.

But indeed this could be better done at the scheduler level, even if naive and imperfect (It's better to catch some and allow them to restart than to catch none and prevent restarting, even if not perfect, I think).

When I implemented the scheduler parsing in aiida-core I actually stayed away from parsing the text in the scheduler stdout and stderr. Experience teaches that parsing from normal text without formal structure is very fragile and error prone. For that reason, I decided that for now we only parse from the official status returned by sacct which uses error codes for the various exit modes, such as OOM and OOW. For Quantum ESPRESSO this was fine, since we had the soft wallclock mechanism we had very few "hard" OOW by the scheduler. But I see the point for other codes that might not have it. I will open an issue on aiida-core to raise your idea of adding stdout and stderr parsing and see if people agree that we add this.

BigDFT-group / aiida-bigdft-plugin-legacy

`BigDftCalculation`: return exit code if the calculation ran out of walltime #3