In testing on fluke, I submitted a slew of jobs and then canceled them, to test cancellation of jobs while the prolog is running.
A small number of nodes were drained with "prolog canceled then timed out". Here's relevant logs from an affected job and node:
The prolog is started at 15:52:53, then canceled at 15:53:27 (flux job eventlog sadly doesn't have a way to present timestamps in the current timezone, will have to fix, so I'm converting to PST here)
The prolog systemd unit shows as started at 15:52:57. A bit unexpected and concerning that it took 5s to just start the unit :thinking:
2024-11-07T15:52:57.171573-08:00 fluke65 systemd[1]: Starting Prolog for Flux job fEi29GDEQPZ...
and the unit got Main process exited by 15:53:42 .. that's a delay of 15s between the cancel exception and the prolog actually being killed:
2024-11-07T15:53:42.746527-08:00 fluke65 systemd[1]: flux-prolog@fEi29GDEQPZ.service: Main process exited, code=killed, status=15/TERM
By this time, the default 10s prolog cancel timeout had elapsed, a job exception was raised the node drained:
2024-11-07T23:53:37.766291Z exception type="prolog" severity=0 note="prolog canceled then timed out on fluke65 (rank 62)" userid=767
It then took another 11s before the unit failed and the prolog was stopped.
2024-11-07T15:53:53.915713-08:00 fluke65 systemd[1]: flux-prolog@fEi29GDEQPZ.service: Failed with result 'signal'.
2024-11-07T15:53:53.916551-08:00 fluke65 systemd[1]: Stopped Prolog for Flux job fEi29GDEQPZ.
If delays like this are common with the systemd managed prolog, it might be wise to increase the SIGTERM timeout from 10s to something more like 30s or 1m. This can be done via configuration, but perhaps a new default would make sense.
In testing on fluke, I submitted a slew of jobs and then canceled them, to test cancellation of jobs while the prolog is running. A small number of nodes were drained with "prolog canceled then timed out". Here's relevant logs from an affected job and node:
The prolog is started at 15:52:53, then canceled at 15:53:27 (
flux job eventlog
sadly doesn't have a way to present timestamps in the current timezone, will have to fix, so I'm converting to PST here)The prolog systemd unit shows as started at 15:52:57. A bit unexpected and concerning that it took 5s to just start the unit :thinking:
and the unit got
Main process exited
by 15:53:42 .. that's a delay of 15s between the cancel exception and the prolog actually being killed:By this time, the default 10s prolog cancel timeout had elapsed, a job exception was raised the node drained:
It then took another 11s before the unit failed and the prolog was stopped.
If delays like this are common with the systemd managed prolog, it might be wise to increase the SIGTERM timeout from 10s to something more like 30s or 1m. This can be done via configuration, but perhaps a new default would make sense.