BigDFT-group / aiida-bigdft-plugin-legacy

BigDFT plugin for Aiida
MIT License
2 stars 2 forks source link

`BigDftCalculation`: return exit code if the calculation ran out of walltime #3

Open sphuber opened 3 years ago

sphuber commented 3 years ago

Currently, if the job is killed by the scheduler, the parser doesn't return an error and so the calcjob is marked with exit code 0 indicating a success. It would be good to add an exit code for this. I can recommend to use ERROR_OUT_OF_WALLTIME(400) which multiple plugins have adopted as a standard now. This will then also make it easy to add a process handler to the baseworkchain that will simply restart the calculation in this case. But even without this, it will be immediately obvious that the calculation actually failed and a user is not misled thinking everything was fine

adegomme commented 3 years ago

How do other plugins detect timeout? Because it's really scheduler-dependent in my understanding, and we don't have any specific timeout message in BigDFT output (it does not know it's been killed, same as OOM). The message in stdout/stderr files depends on the used scheduler/machine, so I wouldn't rely on parsing it, or have each plugin do it.

I see that ERROR_SCHEDULER_OUT_OF_WALLTIME was introduced in aiida-core for slurm a few months ago, but it seems to be not triggered in this case and not used by other plugins? As we use slurm in quantum-mobile, I would expect to have it reported, so that we can eventually catch it and say if we want to restart or not, or just let it display a relevant message.

sphuber commented 3 years ago

You basically have a few options:

sphuber commented 3 years ago

The only thing that surprises me is that I think in any case a log message should be recorded if the scheduler parser detects a problem, so there should be a trace of it in the logs even if it gets overwritten by the parser. I will perform some tests to see what the problem is here.

The reason this didn't get caught is because the scheduler parser output feature is only triggered when the detailed job info is available, which for SLURM is only possible when the accounting option is enabled. This is not the default for an installation as it requires more setup and databases, and so Quantum Mobile does not have this. The reason it needs that info and doesn't just parse it from the stderr, is because it is not sure that the format of the error messages in the stderr are consistent across versions and so the parsing may be very unstable.

adegomme commented 3 years ago

I will have to go for the generic failure indeed, as BigDFT does not offer a walltime limit to exit gracefully .. All we can check is that the final lines are not printed. Sometimes, log is still valid if only the cleanup part was cut, that was the reason why I didn't put a crash in the first place, to still provide outputs if possible.. But that was a bad idea, indeed. And this leads to the restart issue, as OOM kills or other issues are not restartable, while walltime should be, as you said... I will maybe still add an attempt at parsing stderr/stdout to find keywords But indeed this could be better done at the scheduler level, even if naive and imperfect (It's better to catch some and allow them to restart than to catch none and prevent restarting, even if not perfect, I think).

I did a little digging, and from what I see, common walltime exceeded messages for supported schedulers are : Slurm : "CANCELLED AT xx DUE TO TIME LIMIT" https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/req.c#L747 PBS : "PBS: job killed: walltime xx exceeded limit " https://github.com/openpbs/openpbs/blob/master/src/resmom/mom_main.c#L6040 SGE/UGE might be "exceeded hard wallclock time" https://github.com/gridengine/gridengine/blob/master/source/daemons/execd/msg_execd.h#L220 LSF : "TERM_RUNLIMIT: job killed after reaching LSF run time limit" https://www.ibm.com/support/pages/termrunlimit-job-killed-after-reaching-lsf-run-time-limit-exited-exit-code-140 Torque : "walltime xx exceeded limit xx" https://github.com/adaptivecomputing/torque/blob/master/src/resmom/mom_main.c#L3864

Maybe storing some of these regexp in each schedulers plugins and checking them would still be better than nothing ?

sphuber commented 3 years ago

Sometimes, log is still valid if only the cleanup part was cut, that was the reason why I didn't put a crash in the first place, to still provide outputs if possible.. But that was a bad idea, indeed.

If you can parse useful information and attach it as output nodes, I would always do this. This doesn't prevent you from returning a non-zero exit code afterwards. This is what I do for aiida-quantumespresso. Note that a non-zero code doesn't even necessarily mean that the calculation "failed". I have some codes that merely indicate some special edge case that occurred but the calculation and its results can be used in many cases just fine. The exit code is just there to easily distinguish these cases and allow anyone that needs or wants to react to it, do so easily.

But indeed this could be better done at the scheduler level, even if naive and imperfect (It's better to catch some and allow them to restart than to catch none and prevent restarting, even if not perfect, I think).

When I implemented the scheduler parsing in aiida-core I actually stayed away from parsing the text in the scheduler stdout and stderr. Experience teaches that parsing from normal text without formal structure is very fragile and error prone. For that reason, I decided that for now we only parse from the official status returned by sacct which uses error codes for the various exit modes, such as OOM and OOW. For Quantum ESPRESSO this was fine, since we had the soft wallclock mechanism we had very few "hard" OOW by the scheduler. But I see the point for other codes that might not have it. I will open an issue on aiida-core to raise your idea of adding stdout and stderr parsing and see if people agree that we add this.