flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

systemd managed prolog can be slow to be canceled #6420

Closed grondo closed 2 weeks ago

grondo commented 2 weeks ago

In testing on fluke, I submitted a slew of jobs and then canceled them, to test cancellation of jobs while the prolog is running. A small number of nodes were drained with "prolog canceled then timed out". Here's relevant logs from an affected job and node:

The prolog is started at 15:52:53, then canceled at 15:53:27 (flux job eventlog sadly doesn't have a way to present timestamps in the current timezone, will have to fix, so I'm converting to PST here)

2024-11-07T23:52:53.021940Z prolog-start description="job-manager.prolog"
2024-11-07T23:53:27.759535Z exception type="cancel" severity=0 note="" userid=0

The prolog systemd unit shows as started at 15:52:57. A bit unexpected and concerning that it took 5s to just start the unit :thinking:

2024-11-07T15:52:57.171573-08:00 fluke65 systemd[1]: Starting Prolog for Flux job fEi29GDEQPZ...

and the unit got Main process exited by 15:53:42 .. that's a delay of 15s between the cancel exception and the prolog actually being killed:

2024-11-07T15:53:42.746527-08:00 fluke65 systemd[1]: flux-prolog@fEi29GDEQPZ.service: Main process exited, code=killed, status=15/TERM

By this time, the default 10s prolog cancel timeout had elapsed, a job exception was raised the node drained:

2024-11-07T23:53:37.766291Z exception type="prolog" severity=0 note="prolog canceled then timed out on fluke65 (rank 62)" userid=767

It then took another 11s before the unit failed and the prolog was stopped.

2024-11-07T15:53:53.915713-08:00 fluke65 systemd[1]: flux-prolog@fEi29GDEQPZ.service: Failed with result 'signal'.
2024-11-07T15:53:53.916551-08:00 fluke65 systemd[1]: Stopped Prolog for Flux job fEi29GDEQPZ.

If delays like this are common with the systemd managed prolog, it might be wise to increase the SIGTERM timeout from 10s to something more like 30s or 1m. This can be done via configuration, but perhaps a new default would make sense.

garlick commented 2 weeks ago

Slow iscsi root has been blamed for similar problems lately. Agreed, a more generous default timeout makes sense here.