add cleanup timeout for systemctl stop flux on rank 0

garlick commented 3 weeks ago

Problem: systemctl stop flux on rank 0 may hang waiting for the queue to become idle, and the broker may be killed before it can write its KVS checkpoint if the systemd TimeoutStopSec=90s is exceeded.

This adds a 45s timeout over all cleanup actions when the broker is started by systemd. The timeout only applies if the broker is being shut down on a signal, which is the case for systemctl stop flux but not flux shutdown.

If cleanup gets stuck waiting for the queue to become idle after a systemctl stop flux it should move on to stopping the rest of the system after 45s. It's possible we'll discover another place after cleanup where things can go wrong, but this seems like a good first step.

By default this timeout is disabled. It is configured on in the systemd unit file, where TimeoutStopSec is also specified, which I think helps clarity a little bit.

Marking WIP pending testing on a real system instance.

garlick commented 3 weeks ago

I got a job stuck (forever) in the epilog on my test system and ran systemctl stop flux on rank 0. The timeout did occur after 45s and send a SIGHUP to flux queue idle.

The follower brokers then shut down and this caused the perilog plugin to get an EHOSTUNREACH on the compute node that was stuck. It posted the epilog-finish event.

Then the broker exited with code 129, leaving the flux unit in the failed state, appropriate given that cleanup did not complete in time.

Oct 29 10:31:02 picl0 systemd[1]: Stopping flux.service - Flux message broker...
Oct 29 10:31:02 picl0 flux[259773]: broker.info[0]: signal 15
Oct 29 10:31:02 picl0 flux[259773]: broker.info[0]: shutdown: run->cleanup 2.07106m
Oct 29 10:31:03 picl0 flux[259773]: broker.info[0]: cleanup.0: flux queue stop --quiet --all --nocheckpoint Exited (rc=0) 0.1s
Oct 29 10:31:03 picl0 flux[259773]: broker.info[0]: cleanup.1: flux resource acquire-mute Exited (rc=0) 0.1s
Oct 29 10:31:03 picl0 flux[259773]: broker.info[0]: cleanup.2: flux cancel --user=all --quiet --states RUN Exited (rc=0) 0.1s
Oct 29 10:31:47 picl0 flux[259773]: broker.err[0]: cleanup.3: flux queue idle --quiet Hangup (rc=129) 44.6s
Oct 29 10:31:47 picl0 flux[259773]: broker.info[0]: cleanup-fail: cleanup->shutdown 45.0035s
Oct 29 10:31:48 picl0 flux[259773]: job-manager.err[0]: ƒ39firM4HrF: epilog: picl1 (rank 1): No route to host
Oct 29 10:31:48 picl0 flux[259773]: job-manager.err[0]: housekeeping: picl1 (rank 1) ƒ39firM4HrF: No route to host

Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: children-complete: shutdown->finalize 0.551182s
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: rc3.0: running /etc/flux/rc3.d/01-sched-fluxion
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: rc3.0: dumping content to /var/lib/flux/dump/20241029_103148.tgz
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: rc3.0: /etc/flux/rc3 Exited (rc=0) 0.4s
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: rc3-success: finalize->goodbye 0.361826s
Oct 29 10:31:48 picl0 flux[259773]: broker.info[0]: goodbye: goodbye->exit 0.04826ms
Oct 29 10:31:48 picl0 systemd[1]: flux.service: Main process exited, code=exited, status=129/n/a
Oct 29 10:31:48 picl0 systemd[1]: flux.service: Failed with result 'exit-code'.
Oct 29 10:31:48 picl0 systemd[1]: Stopped flux.service - Flux message broker.

Removing WIP as that was the expected behavior.

garlick commented 3 weeks ago

Restarted coverage builder. It was reporting errors like libgcov profiling error:/usr/src/src/common/libhostlist/.libs/hostlist.gcda:Merge mismatch for function 10 and was terminated after 60m. :shrug:

grondo commented 3 weeks ago

What occurred on restart after cleanup was aborted on the stuck job?

garlick commented 3 weeks ago

What occurred on restart after cleanup was aborted on the stuck job?

Nothing - apparently the EHOSTUNREACH on the epilog did not prevent the epilog-finish event from being posted. Hmm, maybe it's a separate issue that the epilog-finish event has status=0 here?

$ flux job eventlog -H ƒ39firM4HrF
[Oct29 10:30] submit userid=5588 urgency=16 flags=0 version=1
[  +0.011701] validate
[  +0.022527] depend
[  +0.022566] priority priority=16
[  +0.025176] alloc
[  +0.025380] prolog-start description="job-manager.prolog"
[  +0.025396] prolog-start description="cray-pals-port-distributor"
[  +0.026028] prolog-finish description="cray-pals-port-distributor" status=0
[  +0.166825] prolog-finish description="job-manager.prolog" status=0
[  +0.183298] start
[  +0.330490] finish status=0
[  +0.330643] epilog-start description="job-manager.epilog"
[  +0.332395] release ranks="all" final=true
[Oct29 10:31] epilog-finish description="job-manager.epilog" status=0
[  +0.000159] free
[  +0.000197] clean

grondo commented 3 weeks ago

maybe it's a separate issue that the epilog-finish event has status=0 here?

I think that was on purpose, but things have changed, e.g. #6349

grondo commented 3 weeks ago

Nothing - apparently the EHOSTUNREACH on the epilog did not prevent the epilog-finish event from being posted.

Ah, ok. I was just making sure we weren't trading an issue at shutdown for an issue at bringup.

garlick commented 3 weeks ago

I was just making sure we weren't trading an issue at shutdown for an issue at bringup.

The main change is that the leaves-to-root shutdown process of running rc3 can begin even though there may be some jobs stuck in various states. Brokers make no attempt to wait for subprocesses while shutting down, so I think the exec EHOSTUNREACH error will become a more prevalent failure mode during shutdown, whether it be in perilog, job-exec, or housekeeping.

I guess the question is will job-exec leave the job eventlogs in a state that will allow them to be replayed on startup without issue. This change does give modules a chance to shut down cleanly as opposed to being summarily SIGKILLed when systemd imposes the timeout, so at least theoretically we are in a better situation.

I'll see if I can get a job stuck while running and see what job-exec does. I expect it to raise a fatal exception (at least one critical rank will always be involved). Perhaps on reload replaying that fatal exception will put the job in CLEANUP state and the job manager will start the epilog?

garlick commented 3 weeks ago

I thought I'd get a job "stuck" by hard-nfs mounting a directory from another node, start a job tailing a file from there, and turn off the nfs server node. Well, this gets the tail process stuck, but if I flux job cancel the job, the job is immediately terminated with the following in the eventlog

[Oct29 16:14] exception type="cancel" severity=0 note="" userid=5588
[  +0.344697] finish status=36608

And just the tail process remains on the node

 garlick@picl0:~$ sudo flux exec -r 1 systemctl --user status
● picl1
    State: running
    Units: 102 loaded (incl. loaded aliases)
     Jobs: 0 queued
   Failed: 0 units
    Since: Thu 2024-10-24 08:56:30 PDT; 5 days ago
  systemd: 252.30-1~deb12u2
   CGroup: /user.slice/user-500.slice/user@500.service
           ├─app.slice
           │ └─imp-shell-1-f3CDZDLRras.service
           │   └─15523 "[tail]"
           ├─init.scope
           │ ├─797 /lib/systemd/systemd --user
           │ └─798 "(sd-pam)"
           └─session.slice
             └─dbus.service
               └─829 /usr/bin/dbus-daemon --session --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only

Hmm, I would have thought the shell would trap the SIGTERM, forward it to the task, then wait for the task to exit, at least until the SIGKILLs come around.

It looks what's happening is the IMP gets the SIGTERM and immediately exits with 128+15, then systemd, having noticed the "main process" exited, immediately does a SIGKILL on the process group.

Oct 29 16:14:17 picl1 systemd[797]: Starting imp-kill-1-f3CDZDLRras.service - User workload...
Oct 29 16:14:17 picl1 systemd[797]: Started imp-kill-1-f3CDZDLRras.service - User workload.
Oct 29 16:14:17 picl1 systemd[797]: Stopped imp-kill-1-f3CDZDLRras.service - User workload.
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Main process exited, code=exited, status=143/n/a
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed to kill control group /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-1-f3CDZDLRras.service, ignoring: Operation not permitted
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Killing process 15523 (tail) with signal SIGKILL.
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed to kill control group /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-1-f3CDZDLRras.service, ignoring: Operation not permitted
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed to kill control group /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-1-f3CDZDLRras.service, ignoring: Operation not permitted
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Killing process 15523 (tail) with signal SIGKILL.
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed to kill control group /user.slice/user-500.slice/user@500.service/app.slice/imp-shell-1-f3CDZDLRras.service, ignoring: Operation not permitted
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Failed with result 'exit-code'.
Oct 29 16:14:17 picl1 systemd[797]: imp-shell-1-f3CDZDLRras.service: Unit process 15523 (tail) remains running after unit stopped.

garlick commented 3 weeks ago

Ah so things work properly, that is the IMP waits for flux-shell which waits for stuck process if I run the tail command directly. I was running it indirectly with sh -c, and there it seems sh exits even though a child is stuck, and flux-shell and the IMP think the job successfully terminated and exit.

I'm not sure what the right behavior is for this case, although this doesn't seem right.

That diversion was actually off topic for this issue. Now that I know how to make things hang, I'll repeat the experiment of shutting the system down with a job that's stuck running.

Edit: well, now I'm getting inconsistent results. Maybe take this with a grain of salt until I can confirm.

garlick commented 2 weeks ago

I think we could consider this for inclusion in our next tag. If included, we could try some tests on fluke where the system is shut down with running jobs and see what pops up. If there is a problem and we choose not to roll it out, a one line change to the main systemd unit file would disable it

garlick commented 2 weeks ago

Setting MWP then.

codecov[bot] commented 2 weeks ago

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 83.56%. Comparing base (9411280) to head (c9db7aa).

Files with missing lines	Patch %	Lines
src/broker/state_machine.c	96.15%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #6397 +/- ## =========================================== + Coverage 53.96% 83.56% +29.60% =========================================== Files 476 524 +48 Lines 80288 87723 +7435 =========================================== + Hits 43328 73307 +29979 + Misses 36960 14416 -22544 ``` | [Files with missing lines](https://app.codecov.io/gh/flux-framework/flux-core/pull/6397?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [src/broker/state\_machine.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6397?src=pr&el=tree&filepath=src%2Fbroker%2Fstate_machine.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2Jyb2tlci9zdGF0ZV9tYWNoaW5lLmM=) | `81.90% <96.15%> (+19.49%)` | :arrow_up: | ... and [446 files with indirect coverage changes](https://app.codecov.io/gh/flux-framework/flux-core/pull/6397/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework)

flux-framework / flux-core

add cleanup timeout for systemctl stop flux on rank 0 #6397

Codecov Report