flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

flux-shutdown: need option to force fast shutdown #5843

Open grondo opened 7 months ago

grondo commented 7 months ago

There are several things that currently block an orderly flux shutdown, including slow epilogs that hold jobs in CLEANUP, bugs in jobtap plugins that leave jobs needing manual cleanup, etc.

It would be nice to have an option to bypass waiting for jobs in CLEANUP until Flux supports a restart with running/cleanup jobs.

Perhaps we could also add another shutdown script that automatically "fixes" any jobs in cleanup by forcing missing epilog-finish events.

garlick commented 7 months ago

To review the current situation, in rc1 we have:

if test $RANK -eq 0; then
    if test -z "${FLUX_DISABLE_JOB_CLEANUP}"; then
        flux admin cleanup-push <<-EOT
        flux queue stop --quiet --all --nocheckpoint
        flux cancel --user=all --quiet --states RUN
        flux queue idle --quiet
        EOT
    fi
fi

When shudown begins, either due to a SIGTERM from systemctl stop flux or from flux shutdown, the first thing that happens is this scriptlet gets executed on rank 0. Only upon completion do we begin running rc3, starting with the TBON leaves, until eventually it runs on rank 0.

The flux queue idle command in the scriptlet will block until there are no jobs in RUN or CLEANUP state. Sadness results when jobs don't respond to the cancel request.

Here's a straw man proposal:

Also: I think some relief may be had once we get #5818 worked out. In that proposal, jobs transition to INACTIVE before the housekeeping script completes. If housekeeping gets hung, it doesn't prevent the instance from stopping, and when it restarts, any still running housekeeping scripts are ignored. We probably need a way to reacquire any running housekeeping tasks on restart and avoid scheduling on those nodes, but the proposed behavior is probably a step in the right direction.

grondo commented 7 months ago

if the coral2 plugins are introducing problems with epilog reference counts, perhaps that package could provide a shutdown.d scriptlet to fix, until a better solution is found?

FYI - I think this particular issue was fixed by flux-framework/flux-coral2#141