@ryanday36 had asked if there was an epilog timeout that could be enforced by Flux.
Currently, there is no way to enforce a timeout for either the prolog or epilog. There a couple ways support could be added.
An option could be added to the flux perilog-run command which would add a time-limit to execution and would drain ranks that timed out. This option could then be optionally passed in via the job-manager.prolog.command or job-manager.epilog.command configuration.
A timelimit option could be added to the job-manager.prolog and job-manager.epilog. If present, then perilog.so plugin would enforce the time limit. This would work for other prolog/epilog commands instead of just flux perilog-run, but if the command timed out, the plugin may not know which broker ranks were still in progress, so all rank may have to be drained.
I also just thought of a 3rd option which is nice in its simplicity:
the flux perilog-run script could be given a handler for SIGTERM/SIGALRM which would trigger it to in turn terminate any running prolog/epilog processes and drain those ranks.
other prolog/epilog scripts run via job-manager.prolog or job-manager.epilog should have similar behavior if doing any kind of per-rank operation
When a timeout is desired to the command in job-manager.epilog could be modified to run under the timeout(1) command, e.g.:
@ryanday36 had asked if there was an epilog timeout that could be enforced by Flux.
Currently, there is no way to enforce a timeout for either the prolog or epilog. There a couple ways support could be added.
flux perilog-run
command which would add a time-limit to execution and would drain ranks that timed out. This option could then be optionally passed in via thejob-manager.prolog.command
orjob-manager.epilog.command
configuration.job-manager.prolog
andjob-manager.epilog
. If present, thenperilog.so
plugin would enforce the time limit. This would work for other prolog/epilog commands instead of justflux perilog-run
, but if the command timed out, the plugin may not know which broker ranks were still in progress, so all rank may have to be drained.I also just thought of a 3rd option which is nice in its simplicity:
flux perilog-run
script could be given a handler forSIGTERM
/SIGALRM
which would trigger it to in turn terminate any running prolog/epilog processes and drain those ranks.job-manager.prolog
orjob-manager.epilog
should have similar behavior if doing any kind of per-rank operationcommand
injob-manager.epilog
could be modified to run under the timeout(1) command, e.g.: