Open grondo opened 7 months ago
To review the current situation, in rc1
we have:
if test $RANK -eq 0; then
if test -z "${FLUX_DISABLE_JOB_CLEANUP}"; then
flux admin cleanup-push <<-EOT
flux queue stop --quiet --all --nocheckpoint
flux cancel --user=all --quiet --states RUN
flux queue idle --quiet
EOT
fi
fi
When shudown begins, either due to a SIGTERM from systemctl stop flux
or from flux shutdown
, the first thing that happens is this scriptlet gets executed on rank 0. Only upon completion do we begin running rc3
, starting with the TBON leaves, until eventually it runs on rank 0.
The flux queue idle
command in the scriptlet will block until there are no jobs in RUN or CLEANUP state. Sadness results when jobs don't respond to the cancel request.
Here's a straw man proposal:
/etc/flux/shutdown
/etc/flux/shutdown.d
flux shutdown
to /etc/flux/shutdown
(and its sub-scripts)flux queue idle
to be run with the --timeout
optionAlso: I think some relief may be had once we get #5818 worked out. In that proposal, jobs transition to INACTIVE before the housekeeping script completes. If housekeeping gets hung, it doesn't prevent the instance from stopping, and when it restarts, any still running housekeeping scripts are ignored. We probably need a way to reacquire any running housekeeping tasks on restart and avoid scheduling on those nodes, but the proposed behavior is probably a step in the right direction.
if the coral2 plugins are introducing problems with epilog reference counts, perhaps that package could provide a shutdown.d scriptlet to fix, until a better solution is found?
FYI - I think this particular issue was fixed by flux-framework/flux-coral2#141
There are several things that currently block an orderly flux shutdown, including slow epilogs that hold jobs in CLEANUP, bugs in jobtap plugins that leave jobs needing manual cleanup, etc.
It would be nice to have an option to bypass waiting for jobs in CLEANUP until Flux supports a restart with running/cleanup jobs.
Perhaps we could also add another shutdown script that automatically "fixes" any jobs in cleanup by forcing missing epilog-finish events.