Closed garlick closed 3 weeks ago
The cleanup phase currently runs
flux queue stop --quiet --all --nocheckpoint
flux resource acquire-mute
flux cancel --user=all --quiet --states RUN
flux queue idle --quiet
flux queue idle
has a --timeout
option. Maybe we ought to set that to 1m
or something for a start?
It seems like it may be a misstep to set the flux queue idle
timeout for all instances. What if a completing batch job is just stuck waiting on lustre/nfs or something?
Maybe we could conditionally impose a timeout on the whole cleanup phase if SIGTERM has been received, and otherwise not have one? Then a canceled batch job or systemctl stop
would terminate more aggressively and get to the phase where a KVS dump is written out quickly, but "nice" shutdown would be unaffected?
Problem: when rank 0 flux is stopped with
systemctl stop flux
on a real system, timeouts are likely.(We recommend
flux shutdown
to stop flux outside of systemd timeouts, but this is not always followed)In our systemd unit file (which applies to all ranks) we have
TimeoutStopSec=90
so the whole process must take less than 90s before systemd forcibly kills us.Exception: we have
Type=notify
andflux dump
andflux restore
callsd_notify()
periodically to extend timeouts (see #5840).However the cleanup phase can get stuck if jobs won't terminate.