job-manager: perilog jobtap plugin loses track of prolog/epilog processes on restart

grondo commented 2 years ago

Problem: The perilog jobtap plugin uses libsubprocess to execute and monitor job prolog/epilog commands on rank 0. This means that when the rank 0 broker restarts, these processes are lost and jobs could get stuck with pending perilog actions.

One possible solution would be to add a service endpoint to the perilog plugin which would accept "finish" notifications via a direct RPC. The existing perilog-run helper script could then send this RPC explicitly before it exits. If there is a failure to open a Flux handle at the end of the script, then the script could block until a handle was available, assuming the rank 0 broker is restarting. The perilog plugin could still monitor subprocesses as is done now as a fallback.

This RPC would also allow manual release of prolog/epilog-start events in the case something went wrong (like the perilog-run script exits or dies after a restart before it can send the finish RPC).

The other possible solution, of course, would be to switch the perilog plugin, or make a clone of it, that uses libsdexec instead of libsdprocess. I'm not sure that would be quite as easy as the first solution proposed above.

garlick commented 2 years ago

Should we think about a multi-phase shutdown sequence for Flux that ensure that new perilogs don't start and allows running ones to finish under a timeout?

Maybe related to #3895?

grondo commented 2 years ago

That sounds like a good idea, but doesn't help in the case where the broker crashes, in which case we'd need one of the solutions above or a job could get stuck forever.

This does remind me that another issue with a clean broker shutdown is that I think all executing subprocesses are currently killed by the broker. Your idea would certainly resolve that issue. However, if we already need to allow these subprocesses to survive a broker crash, we get survival of broker restart for free... (just have to allow some subprocesses to bypass the auto-kill-on-shutdown in the broker.exec service)

grondo commented 2 years ago

It occurred to me that the current solution for job prolog/epilog uses flux perilog-run which in turn (optionally) runs per-rank prolog/epilog with flux exec. If the broker crashes or restarts, flux exec will terminate and in turn lose track of executing node epilog/prolog.

So, while having the job-manager prolog/epilog "check in" at exit could solve the problem when no per-rank perilogs are executed, it doesn't help our current case.

In the long term, the rank-local job-exec module will execute prolog/epilog under systemd when use-systemd = true, so this problem will have a solution.

For now, it does seem like we'll need to support a safe or phased shutdown as suggested by @garlick in order to at least allow rank 0 to be restarted manually, in case of change in configuration, upgrade, or problem resolution. I don't have a good feeling of where to start to get this implemented, though.

To handle the case of an unplanned broker restart, perhaps the job manager should raise an exception on any jobs that have the perilog_active flag set after a restart (I think to support that the job manager would need to iterate all active jobs after restarting from the kvs)

flux-framework / flux-core

job-manager: perilog jobtap plugin loses track of prolog/epilog processes on restart #4108