Closed garlick closed 2 weeks ago
I got a job stuck (forever) in the epilog on my test system and ran systemctl stop flux
on rank 0. The timeout did occur after 45s and send a SIGHUP to flux queue idle
.
The follower brokers then shut down and this caused the perilog plugin to get an EHOSTUNREACH on the compute node that was stuck. It posted the epilog-finish event.
Then the broker exited with code 129, leaving the flux unit in the failed state, appropriate given that cleanup did not complete in time.
Oct 29 10:31:02 picl0 systemd[1]: Stopping flux.service - Flux message broker...
Oct 29 10:31:02 picl0 flux[259773]: broker.info[0]: signal 15
Oct 29 10:31:02 picl0 flux[259773]: broker.info[0]: shutdown: run->cleanup 2.07106m
Oct 29 10:31:03 picl0 flux[259773]: broker.info[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.1s
Oct 29 10:31:03 picl0 flux[259773]: broker.info[0]: cleanup.1: flux resource acquire-mute Exited (rc=0) 0.1s
Oct 29 10:31:03 picl0 flux[259773]: broker.info[0]: cleanup.2: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.1s
Oct 29 10:31:47 picl0 flux[259773]: broker.err[0]: cleanup.3: flux queue idle --quiet Hangup (rc=129) 44.6s
Oct 29 10:31:47 picl0 flux[259773]: broker.info[0]: cleanup-fail: cleanup->shutdown 45.0035s
Oct 29 10:31:48 picl0 flux[259773]: job-manager.err[0]: ƒ39firM4HrF: epilog: picl1 (rank 1): No route to host
Oct 29 10:31:48 picl0 flux[259773]: job-manager.err[0]: housekeeping: picl1 (rank 1) ƒ39firM4HrF: No route to host
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: children-complete: shutdown->finalize 0.551182s
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: rc3.0: running /etc/flux/rc3.d/01-sched-fluxion
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: rc3.0: dumping content to /var/lib/flux/dump/20241029_103148.tgz
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: rc3.0: /etc/flux/rc3 Exited (rc=0) 0.4s
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: rc3-success: finalize->goodbye 0.361826s
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: goodbye: goodbye->exit 0.04826ms
Oct 29 10:31:48 picl0 systemd[1]: flux.service: Main process exited, code=exited, status=129/n/a
Oct 29 10:31:48 picl0 systemd[1]: flux.service: Failed with result 'exit-code'.
Oct 29 10:31:48 picl0 systemd[1]: Stopped flux.service - Flux message broker.
Removing WIP as that was the expected behavior.
Restarted coverage builder. It was reporting errors like libgcov profiling error:/usr/src/src/common/libhostlist/.libs/hostlist.gcda:Merge mismatch for function 10
and was terminated after 60m. :shrug:
What occurred on restart after cleanup was aborted on the stuck job?
What occurred on restart after cleanup was aborted on the stuck job?
Nothing - apparently the EHOSTUNREACH on the epilog did not prevent the epilog-finish
event from being posted. Hmm, maybe it's a separate issue that the epilog-finish
event has status=0
here?
$ flux job eventlog -H ƒ39firM4HrF
[Oct29 10:30] submit userid=5588 urgency=16 flags=0 version=1
[ +0.011701] validate
[ +0.022527] depend
[ +0.022566] priority priority=16
[ +0.025176] alloc
[ +0.025380] prolog-start description="job-manager.prolog"
[ +0.025396] prolog-start description="cray-pals-port-distributor"
[ +0.026028] prolog-finish description="cray-pals-port-distributor" status=0
[ +0.166825] prolog-finish description="job-manager.prolog" status=0
[ +0.183298] start
[ +0.330490] finish status=0
[ +0.330643] epilog-start description="job-manager.epilog"
[ +0.332395] release ranks="all" final=true
[Oct29 10:31] epilog-finish description="job-manager.epilog" status=0
[ +0.000159] free
[ +0.000197] clean
maybe it's a separate issue that the epilog-finish event has status=0 here?
I think that was on purpose, but things have changed, e.g. #6349
Nothing - apparently the EHOSTUNREACH on the epilog did not prevent the epilog-finish event from being posted.
Ah, ok. I was just making sure we weren't trading an issue at shutdown for an issue at bringup.
I was just making sure we weren't trading an issue at shutdown for an issue at bringup.
The main change is that the leaves-to-root shutdown process of running rc3 can begin even though there may be some jobs stuck in various states. Brokers make no attempt to wait for subprocesses while shutting down, so I think the exec EHOSTUNREACH error will become a more prevalent failure mode during shutdown, whether it be in perilog, job-exec, or housekeeping.
I guess the question is will job-exec leave the job eventlogs in a state that will allow them to be replayed on startup without issue. This change does give modules a chance to shut down cleanly as opposed to being summarily SIGKILLed when systemd imposes the timeout, so at least theoretically we are in a better situation.
I'll see if I can get a job stuck while running and see what job-exec does. I expect it to raise a fatal exception (at least one critical rank will always be involved). Perhaps on reload replaying that fatal exception will put the job in CLEANUP state and the job manager will start the epilog?
I thought I'd get a job "stuck" by hard-nfs mounting a directory from another node, start a job tailing a file from there, and turn off the nfs server node. Well, this gets the tail process stuck, but if I flux job cancel
the job, the job is immediately terminated with the following in the eventlog
[Oct29 16:14] exception type="cancel" severity=0 note="" userid=5588
[ +0.344697] finish status=36608
And just the tail process remains on the node
garlick@picl0:~$ sudo flux exec -r 1 systemctl --user status
● picl1
State: running
Units: 102 loaded (incl. loaded aliases)
Jobs: 0 queued
Failed: 0 units
Since: Thu 2024-10-24 08:56:30 PDT; 5 days ago
systemd: 252.30-1~deb12u2
CGroup: /user.slice/user-500.slice/user@500.service
├─app.slice
│ └─imp-shell-1-f3CDZDLRras.service
│ └─15523 "[tail]"
├─init.scope
│ ├─797 /lib/systemd/systemd --user
│ └─798 "(sd-pam)"
└─session.slice
└─dbus.service
└─829 /usr/bin/dbus-daemon --session --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
Hmm, I would have thought the shell would trap the SIGTERM, forward it to the task, then wait for the task to exit, at least until the SIGKILLs come around.
It looks what's happening is the IMP gets the SIGTERM and immediately exits with 128+15, then systemd, having noticed the "main process" exited, immediately does a SIGKILL on the process group.
Oct 29 16:14:17 picl1 systemd[797]: Starting imp-kill-1-f3CDZDLRras.service - User workload...
Oct 29 16:14:17 picl1 systemd[797]: Started imp-kill-1-f3CDZDLRras.service - User workload.
Oct 29 16:14:17 picl1 systemd[797]: Stopped imp-kill-1-f3CDZDLRras.service - User workload.
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Main process exited, code=exited, status=143/n/a
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed to kill control group /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-1-f3CDZDLRras.service, ignoring: Operation not permitted
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Killing process 15523 (tail) with signal SIGKILL.
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed to kill control group /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-1-f3CDZDLRras.service, ignoring: Operation not permitted
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed to kill control group /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-1-f3CDZDLRras.service, ignoring: Operation not permitted
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Killing process 15523 (tail) with signal SIGKILL.
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed to kill control group /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-1-f3CDZDLRras.service, ignoring: Operation not permitted
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed with result 'exit-code'.
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Unit process 15523 (tail) remains running after unit stopped.
Ah so things work properly, that is the IMP waits for flux-shell
which waits for stuck process if I run the tail
command directly. I was running it indirectly with sh -c
, and there it seems sh
exits even though a child is stuck, and flux-shell
and the IMP think the job successfully terminated and exit.
I'm not sure what the right behavior is for this case, although this doesn't seem right.
That diversion was actually off topic for this issue. Now that I know how to make things hang, I'll repeat the experiment of shutting the system down with a job that's stuck running.
Edit: well, now I'm getting inconsistent results. Maybe take this with a grain of salt until I can confirm.
I think we could consider this for inclusion in our next tag. If included, we could try some tests on fluke where the system is shut down with running jobs and see what pops up. If there is a problem and we choose not to roll it out, a one line change to the main systemd unit file would disable it
Setting MWP then.
Attention: Patch coverage is 96.15385%
with 1 line
in your changes missing coverage. Please review.
Project coverage is 83.56%. Comparing base (
9411280
) to head (c9db7aa
).
Files with missing lines | Patch % | Lines |
---|---|---|
src/broker/state_machine.c | 96.15% | 1 Missing :warning: |
Problem:
systemctl stop flux
on rank 0 may hang waiting for the queue to become idle, and the broker may be killed before it can write its KVS checkpoint if the systemdTimeoutStopSec=90s
is exceeded.This adds a 45s timeout over all cleanup actions when the broker is started by systemd. The timeout only applies if the broker is being shut down on a signal, which is the case for
systemctl stop flux
but notflux shutdown
.If cleanup gets stuck waiting for the queue to become idle after a
systemctl stop flux
it should move on to stopping the rest of the system after 45s. It's possible we'll discover another place after cleanup where things can go wrong, but this seems like a good first step.By default this timeout is disabled. It is configured on in the systemd unit file, where
TimeoutStopSec
is also specified, which I think helps clarity a little bit.Marking WIP pending testing on a real system instance.