Open grondo opened 2 years ago
Should we think about a multi-phase shutdown sequence for Flux that ensure that new perilogs don't start and allows running ones to finish under a timeout?
Maybe related to #3895?
That sounds like a good idea, but doesn't help in the case where the broker crashes, in which case we'd need one of the solutions above or a job could get stuck forever.
This does remind me that another issue with a clean broker shutdown is that I think all executing subprocesses are currently killed by the broker. Your idea would certainly resolve that issue. However, if we already need to allow these subprocesses to survive a broker crash, we get survival of broker restart for free... (just have to allow some subprocesses to bypass the auto-kill-on-shutdown in the broker.exec service)
It occurred to me that the current solution for job prolog/epilog uses flux perilog-run
which in turn (optionally) runs per-rank prolog/epilog with flux exec
. If the broker crashes or restarts, flux exec
will terminate and in turn lose track of executing node epilog/prolog.
So, while having the job-manager prolog/epilog "check in" at exit could solve the problem when no per-rank perilogs are executed, it doesn't help our current case.
In the long term, the rank-local job-exec
module will execute prolog/epilog under systemd when use-systemd = true
, so this problem will have a solution.
For now, it does seem like we'll need to support a safe or phased shutdown as suggested by @garlick in order to at least allow rank 0 to be restarted manually, in case of change in configuration, upgrade, or problem resolution. I don't have a good feeling of where to start to get this implemented, though.
To handle the case of an unplanned broker restart, perhaps the job manager should raise an exception on any jobs that have the perilog_active
flag set after a restart (I think to support that the job manager would need to iterate all active jobs after restarting from the kvs)
Problem: The
perilog
jobtap plugin uses libsubprocess to execute and monitor job prolog/epilog commands on rank 0. This means that when the rank 0 broker restarts, these processes are lost and jobs could get stuck with pending perilog actions.One possible solution would be to add a service endpoint to the
perilog
plugin which would accept "finish" notifications via a direct RPC. The existingperilog-run
helper script could then send this RPC explicitly before it exits. If there is a failure to open a Flux handle at the end of the script, then the script could block until a handle was available, assuming the rank 0 broker is restarting. Theperilog
plugin could still monitor subprocesses as is done now as a fallback.This RPC would also allow manual release of prolog/epilog-start events in the case something went wrong (like the perilog-run script exits or dies after a restart before it can send the finish RPC).
The other possible solution, of course, would be to switch the
perilog
plugin, or make a clone of it, that useslibsdexec
instead of libsdprocess. I'm not sure that would be quite as easy as the first solution proposed above.