systemctl stop flux on rank 0 is fragile

flux-framework / flux-core

core services for the Flux resource management framework

GNU Lesser General Public License v3.0

168 stars 50 forks source link

systemctl stop flux on rank 0 is fragile #6388

Closed garlick closed 3 weeks ago

garlick commented 1 month ago

Problem: when rank 0 flux is stopped with systemctl stop flux on a real system, timeouts are likely.

(We recommend flux shutdown to stop flux outside of systemd timeouts, but this is not always followed)

In our systemd unit file (which applies to all ranks) we have TimeoutStopSec=90 so the whole process must take less than 90s before systemd forcibly kills us.

Exception: we have Type=notify and flux dump and flux restore call sd_notify() periodically to extend timeouts (see #5840).

However the cleanup phase can get stuck if jobs won't terminate.

garlick commented 1 month ago

The cleanup phase currently runs

flux queue stop --quiet --all --nocheckpoint
flux resource acquire-mute
flux cancel --user=all --quiet --states RUN
flux queue idle --quiet

flux queue idle has a --timeout option. Maybe we ought to set that to 1m or something for a start?

garlick commented 1 month ago

It seems like it may be a misstep to set the flux queue idle timeout for all instances. What if a completing batch job is just stuck waiting on lustre/nfs or something?

Maybe we could conditionally impose a timeout on the whole cleanup phase if SIGTERM has been received, and otherwise not have one? Then a canceled batch job or systemctl stop would terminate more aggressively and get to the phase where a KVS dump is written out quickly, but "nice" shutdown would be unaffected?