Closed sphuber closed 2 years ago
Yep sounds good, in aiida-crystal17 I had already specifically added OOW and OOM parsing in the calc parser for PBS (as thats the scheduler I was using): https://github.com/aiidaplugins/aiida-crystal17/blob/master/aiida_crystal17/parsers/raw/pbs.py
I would think its more precise to use None
for empty variables
I would think its more precise to use
None
for empty variables
I agree, except here that could potentially be breaking, since the implementations may rely on the fact that currently a string is always passed for stdout
and stderr
. If we don't want to break this, we have to use something else than None
.
Indeed I feel that this should be attempted in aiida-core (which knows the scheduler and can get the wanted regexp directly) and not separately on most of the code plugins, which would have to query the code/computer for the used scheduler or try every single one they can think of.
For OOM messages, I did a little digging yesterday and went for these
oom_messages = {
'[oO]ut [oO]f [mM]emory',
'oom-kill', # generic OOM messages
'Exceeded .* memory limit', # slurm
'exceeds job hard limit .*mem.* of queue', # UGE
'TERM_MEMLIMIT: job killed after reaching LSF memory usage limit', # LFS
'mem .* exceeded limit', # PBS/Torque
}
If people see different messages on their systems, maybe they could add them in this conversation. I didn't see localized versions, for example, and I'm curious to know if some computing centers speak a different language by default (but maybe aiida can/does already ask for english) .
I agree that passing an empty string for stdout/stderr is not a very good idea (empty stderr typically should mean no error printed). I would go for None, and check which plugins might be using the feature (there are very few scheduler plugins out there, we should check the plugin registry; and probably even less that use this feature, and we can ask them to change). Also, if this ends up in 2.0 (probable), it's ok to change the interface.
Good point about localisation - at the moment I don't think AiiDA asks to change localisation back to English; to do this one should probably set the proper environment variables (I guess the LC_*
); it's also true that nobody yet complained about this, so my guess is that in practice most supercomputers always use English.
Practically, I would probably start the easy way and support English as this is what we need most, and if someone reports that they have a different language, we investigate further how to ensure we ask for an English reply
+1 for None
and a big +1 for adding this feature - this is essential for high-throughput in cases where sacct
is not available
@sphuber would you be able to start with a small PR that makes the interface-level changes you envision, enabling others to follow up with the changes for the various scheduler plugins?
Thanks to @sphuber , this issue has been addressed for SLURM in #5458 .
Analogous fixes for the other schedulers should be straightforward -
Walltime:
Slurm : "CANCELLED AT xx DUE TO TIME LIMIT" https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/req.c#L747 PBS : "PBS: job killed: walltime xx exceeded limit " https://github.com/openpbs/openpbs/blob/master/src/resmom/mom_main.c#L6040 SGE/UGE might be "exceeded hard wallclock time" https://github.com/gridengine/gridengine/blob/master/source/daemons/execd/msg_execd.h#L220 LSF : "TERM_RUNLIMIT: job killed after reaching LSF run time limit" https://www.ibm.com/support/pages/termrunlimit-job-killed-after-reaching-lsf-run-time-limit-exited-exit-code-140 Torque : "walltime xx exceeded limit xx" https://github.com/adaptivecomputing/torque/blob/master/src/resmom/mom_main.c#L3864
Memory
oom_messages = {
'[oO]ut [oO]f [mM]emory',
'oom-kill', # generic OOM messages
'Exceeded .* memory limit', # slurm
'exceeds job hard limit .*mem.* of queue', # UGE
'TERM_MEMLIMIT: job killed after reaching LSF memory usage limit', # LFS
'mem .* exceeded limit', # PBS/Torque
}
When we do (I might get round to it some time later this week), let's also fix this small issue that I overlooked https://github.com/aiidateam/aiida-core/pull/5458#discussion_r831586851 and add some tests
When we implemented the
Scheduler.parse_output
functionality to let scheduler plugins parse the scheduler output in order to detect potential problems, we decided to not rely exclusively on the text in the stdout and stderr streams, since parsing information reliably from these unstructured text sources is error prone and unreliable. For SLURM, OOM and OOW problems are detected exclusively from the detailed job info, which is reported by thesacct
command. However, this is not a standard feature for the SLURM scheduler and not all installations have it. This causes OOW and OOM errors to get passed unnoticed for many installations. The question is now if we should relax this strict condition of requiring thedetailed_job_info
in theScheduler.parse_output
call and allow to also parse from the text.@adegomme suggested the following regexs for a number of schedulers in this issue:
Slurm : "CANCELLED AT xx DUE TO TIME LIMIT" https://github.com/SchedMD/slurm/blob/master/src/slurmd/slurmstepd/req.c#L747 PBS : "PBS: job killed: walltime xx exceeded limit " https://github.com/openpbs/openpbs/blob/master/src/resmom/mom_main.c#L6040 SGE/UGE might be "exceeded hard wallclock time" https://github.com/gridengine/gridengine/blob/master/source/daemons/execd/msg_execd.h#L220 LSF : "TERM_RUNLIMIT: job killed after reaching LSF run time limit" https://www.ibm.com/support/pages/termrunlimit-job-killed-after-reaching-lsf-run-time-limit-exited-exit-code-140 Torque : "walltime xx exceeded limit xx" https://github.com/adaptivecomputing/torque/blob/master/src/resmom/mom_main.c#L3864
Implementing these should hopefully just add a lot of chances of detecting OOW even though some might still be missed if the format is different for other versions of the schedulers. At least, hopefully it should not report false positives.
The one technical question is that we would have to change the interface. Currently we have:
so all arguments are required, but we would have to make them all optional. I am not sure how many plugins outside of
aiida-core
already rely on this, but we should be careful with breaking this. Maybe it would be possible to pass empty defaults if any of them could not be retrieved, i.e. an empty dict fordetailed_job_info
and an empty string forstdout
andstderr
?Pinging @mbercx @giovannipizzi