cancellation of job hierarchy doesn't fully clean up

grondo commented 2 years ago

I noticed after make check on recent master I have orphan jobs left around that were part of tests. Some digging revealed these were specifically left over from the t2803-pstree.t "start a recursive job" test:

test_expect_success 'start a recursive job' '
        id=$(flux mini submit flux start /bin/true) &&
        rid=$(flux mini submit -n2 \
                flux start \
                flux mini submit --wait --cc=1-2 flux start \
                        "flux mini submit sleep inf && \
                         touch ready.\$FLUX_JOB_CC && \
                         flux queue idle") &&
        flux job wait-event $id clean
'

It appears that when this this nested job is terminated at the end of the sharness test, more times than not, some of the child job's processes or even brokers remain running.

This can be reproduced via the following:

bash-4.2$ id=$(flux mini submit --output=level1.out -n2 flux start -o,-Slog-stderr-level=7 flux mini submit --wait --cc=1-2 sh -c "flux start -o,-Slog-stderr-level=7 'flux mini submit sleep 2000 && flux queue idle' >logfile.\$FLUX_JOB_CC 2>&1")
bash-4.2$ flux pstree -p
. .
└── ƒZawxb1nF flux
    ├── ƒiQiwXe sh
    │   └── sleep
    └── ƒiQiwXd sh
        └── sleep
bash-4.2$ flux job cancel ƒZawxb1nF
bash-4.2$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES  RUNTIME NODELIST
bash-4.2$ ps auxww | grep sleep
grondo   23119  0.0  0.0  13304  3060 pts/0    S    22:20   0:00 sh -c flux start -o,-Slog-stderr-level=7 'flux mini submit sleep 2000 && flux queue idle' >logfile.$FLUX_JOB_CC 2>&1
grondo   23120  1.2  0.1 1246072 13896 pts/0   Sl   22:20   0:00 /usr/src/src/broker/.libs/lt-flux-broker -Slog-stderr-level=7 flux mini submit sleep 2000 && flux queue idle
grondo   24201  0.0  0.0  13304  3116 pts/0    S    22:20   0:00 /bin/bash -c flux mini submit sleep 2000 && flux queue idle
grondo   24317  0.0  0.0   5948   724 pts/0    S    22:20   0:00 sleep 2000
grondo   25400  0.0  0.0  10704  2264 pts/0    S+   22:20   0:00 grep sleep

The command above captures output from the top level instance in level1.out and the sub-instances in logfile.1 and logfile.2. For the "failing" instance still seen running above, the output has only the following:

bash-4.2$ cat logfile.2
2022-01-12T22:20:37.957240Z broker.debug[0]: insmod connector-local
2022-01-12T22:20:37.957386Z broker.info[0]: start: none->join 0.0190167s
2022-01-12T22:20:37.957425Z broker.info[0]: parent-none: join->init 2.5277e-05s
2022-01-12T22:20:37.963354Z connector-local.debug[0]: allow-guest-user=false
2022-01-12T22:20:37.963461Z connector-local.debug[0]: allow-root-owner=false
2022-01-12T22:20:38.153420Z broker.debug[0]: insmod barrier
2022-01-12T22:20:38.263606Z broker.debug[0]: insmod content-sqlite
2022-01-12T22:20:38.274076Z broker.debug[0]: content backing store: enabled content-sqlite
2022-01-12T22:20:38.378948Z broker.debug[0]: insmod kvs
2022-01-12T22:20:38.483517Z broker.debug[0]: insmod kvs-watch
2022-01-12T22:20:38.604977Z broker.debug[0]: insmod resource
2022-01-12T22:20:38.619661Z resource.debug[0]: xml 1 ranks in 1 objects
2022-01-12T22:20:38.620652Z resource.debug[0]: reslog_cb: resource-init event posted
2022-01-12T22:20:38.622481Z resource.debug[0]: reslog_cb: resource-define event posted
2022-01-12T22:20:38.729438Z broker.debug[0]: insmod cron
2022-01-12T22:20:38.738665Z cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
2022-01-12T22:20:38.835928Z broker.debug[0]: insmod job-manager
2022-01-12T22:20:38.845065Z job-manager.info[0]: restart: 0 jobs
2022-01-12T22:20:38.845164Z job-manager.info[0]: restart: 0 running jobs
2022-01-12T22:20:38.845345Z job-manager.info[0]: restart: no checkpoint object
2022-01-12T22:20:38.845400Z job-manager.debug[0]: restart: max_jobid=0
2022-01-12T22:20:38.941659Z broker.debug[0]: insmod job-info
2022-01-12T22:20:39.056832Z broker.debug[0]: insmod job-list
2022-01-12T22:20:39.066549Z job-list.debug[0]: job_state_init_from_kvs: read 0 jobs
2022-01-12T22:20:39.166101Z broker.debug[0]: insmod job-archive
2022-01-12T22:20:39.272728Z broker.debug[0]: insmod job-ingest
2022-01-12T22:20:39.282495Z job-ingest.debug[0]: fluid ts=1ms
2022-01-12T22:20:39.386702Z broker.debug[0]: insmod job-exec
2022-01-12T22:20:39.395973Z job-exec.debug[0]: using default shell path /usr/src/src/shell/flux-shell
2022-01-12T22:20:39.498892Z broker.debug[0]: insmod heartbeat
2022-01-12T22:20:39.502571Z broker.info[0]: rc1.0: running /usr/src/etc/rc1.d/02-cron
2022-01-12T22:20:39.785795Z broker.debug[0]: insmod sched-simple
2022-01-12T22:20:39.806983Z sched-simple.debug[0]: service_register
2022-01-12T22:20:39.808344Z sched-simple.info[0]: resource update: {"resources":{"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}],"starttime":0.0,"expiration":0.0,"nodelist":["asp"]}},"up":""}
2022-01-12T22:20:39.808498Z job-manager.debug[0]: scheduler: hello
2022-01-12T22:20:39.808694Z job-manager.debug[0]: scheduler: ready limited
2022-01-12T22:20:39.808856Z sched-simple.debug[0]: ready: 0 of 1 cores: 
2022-01-12T22:20:40.123925Z broker.info[0]: rc1.0: /bin/bash -c /usr/src/etc/rc1 Exited (rc=0) 2.2s
2022-01-12T22:20:40.124055Z broker.info[0]: rc1-success: init->quorum 2.16662s
2022-01-12T22:20:40.224633Z broker.debug[0]: groups: set broker.online 0
2022-01-12T22:20:40.224832Z broker.info[0]: quorum-full: quorum->run 0.100756s
2022-01-12T22:20:40.240831Z resource.debug[0]: reslog_cb: online event posted
2022-01-12T22:20:40.241027Z sched-simple.info[0]: resource update: {"up":"0"}
2022-01-12T22:20:41.445868Z sched-simple.debug[0]: req: 36087791616: spec={0,1,1} duration=0.0
2022-01-12T22:20:41.445982Z sched-simple.debug[0]: alloc: 36087791616: rank0/core0
ƒwyxNAj
2022-01-12T22:20:46.437427Z job-ingest.debug[0]: validator[0]: inactivity timeout
bash-4.2$ cat logfile.2
2022-01-12T22:20:37.957240Z broker.debug[0]: insmod connector-local
2022-01-12T22:20:37.957386Z broker.info[0]: start: none->join 0.0190167s
2022-01-12T22:20:37.957425Z broker.info[0]: parent-none: join->init 2.5277e-05s
2022-01-12T22:20:37.963354Z connector-local.debug[0]: allow-guest-user=false
2022-01-12T22:20:37.963461Z connector-local.debug[0]: allow-root-owner=false
2022-01-12T22:20:38.153420Z broker.debug[0]: insmod barrier
2022-01-12T22:20:38.263606Z broker.debug[0]: insmod content-sqlite
2022-01-12T22:20:38.274076Z broker.debug[0]: content backing store: enabled content-sqlite
2022-01-12T22:20:38.378948Z broker.debug[0]: insmod kvs
2022-01-12T22:20:38.483517Z broker.debug[0]: insmod kvs-watch
2022-01-12T22:20:38.604977Z broker.debug[0]: insmod resource
2022-01-12T22:20:38.619661Z resource.debug[0]: xml 1 ranks in 1 objects
2022-01-12T22:20:38.620652Z resource.debug[0]: reslog_cb: resource-init event posted
2022-01-12T22:20:38.622481Z resource.debug[0]: reslog_cb: resource-define event posted
2022-01-12T22:20:38.729438Z broker.debug[0]: insmod cron
2022-01-12T22:20:38.738665Z cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
2022-01-12T22:20:38.835928Z broker.debug[0]: insmod job-manager
2022-01-12T22:20:38.845065Z job-manager.info[0]: restart: 0 jobs
2022-01-12T22:20:38.845164Z job-manager.info[0]: restart: 0 running jobs
2022-01-12T22:20:38.845345Z job-manager.info[0]: restart: no checkpoint object
2022-01-12T22:20:38.845400Z job-manager.debug[0]: restart: max_jobid=0
2022-01-12T22:20:38.941659Z broker.debug[0]: insmod job-info
2022-01-12T22:20:39.056832Z broker.debug[0]: insmod job-list
2022-01-12T22:20:39.066549Z job-list.debug[0]: job_state_init_from_kvs: read 0 jobs
2022-01-12T22:20:39.166101Z broker.debug[0]: insmod job-archive
2022-01-12T22:20:39.272728Z broker.debug[0]: insmod job-ingest
2022-01-12T22:20:39.282495Z job-ingest.debug[0]: fluid ts=1ms
2022-01-12T22:20:39.386702Z broker.debug[0]: insmod job-exec
2022-01-12T22:20:39.395973Z job-exec.debug[0]: using default shell path /usr/src/src/shell/flux-shell
2022-01-12T22:20:39.498892Z broker.debug[0]: insmod heartbeat
2022-01-12T22:20:39.502571Z broker.info[0]: rc1.0: running /usr/src/etc/rc1.d/02-cron
2022-01-12T22:20:39.785795Z broker.debug[0]: insmod sched-simple
2022-01-12T22:20:39.806983Z sched-simple.debug[0]: service_register
2022-01-12T22:20:39.808344Z sched-simple.info[0]: resource update: {"resources":{"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}],"starttime":0.0,"expiration":0.0,"nodelist":["asp"]}},"up":""}
2022-01-12T22:20:39.808498Z job-manager.debug[0]: scheduler: hello
2022-01-12T22:20:39.808694Z job-manager.debug[0]: scheduler: ready limited
2022-01-12T22:20:39.808856Z sched-simple.debug[0]: ready: 0 of 1 cores: 
2022-01-12T22:20:40.123925Z broker.info[0]: rc1.0: /bin/bash -c /usr/src/etc/rc1 Exited (rc=0) 2.2s
2022-01-12T22:20:40.124055Z broker.info[0]: rc1-success: init->quorum 2.16662s
2022-01-12T22:20:40.224633Z broker.debug[0]: groups: set broker.online 0
2022-01-12T22:20:40.224832Z broker.info[0]: quorum-full: quorum->run 0.100756s
2022-01-12T22:20:40.240831Z resource.debug[0]: reslog_cb: online event posted
2022-01-12T22:20:40.241027Z sched-simple.info[0]: resource update: {"up":"0"}
2022-01-12T22:20:41.445868Z sched-simple.debug[0]: req: 36087791616: spec={0,1,1} duration=0.0
2022-01-12T22:20:41.445982Z sched-simple.debug[0]: alloc: 36087791616: rank0/core0
ƒwyxNAj
2022-01-12T22:20:46.437427Z job-ingest.debug[0]: validator[0]: inactivity timeout

The sub-instance that did exit normally has the following log:

bash-4.2$ cat logfile.1
2022-01-12T22:20:37.902549Z broker.debug[0]: insmod connector-local
2022-01-12T22:20:37.902686Z broker.info[0]: start: none->join 0.0162915s
2022-01-12T22:20:37.902726Z broker.info[0]: parent-none: join->init 2.5025e-05s
2022-01-12T22:20:37.909440Z connector-local.debug[0]: allow-guest-user=false
2022-01-12T22:20:37.909522Z connector-local.debug[0]: allow-root-owner=false
2022-01-12T22:20:38.103479Z broker.debug[0]: insmod barrier
2022-01-12T22:20:38.208696Z broker.debug[0]: insmod content-sqlite
2022-01-12T22:20:38.218608Z broker.debug[0]: content backing store: enabled content-sqlite
2022-01-12T22:20:38.321701Z broker.debug[0]: insmod kvs
2022-01-12T22:20:38.431571Z broker.debug[0]: insmod kvs-watch
2022-01-12T22:20:38.539079Z broker.debug[0]: insmod resource
2022-01-12T22:20:38.559480Z resource.debug[0]: xml 1 ranks in 1 objects
2022-01-12T22:20:38.560533Z resource.debug[0]: reslog_cb: resource-init event posted
2022-01-12T22:20:38.562668Z resource.debug[0]: reslog_cb: resource-define event posted
2022-01-12T22:20:38.673871Z broker.debug[0]: insmod cron
2022-01-12T22:20:38.683448Z cron.info[0]: synchronizing cron tasks to event heartbeat.pulse
2022-01-12T22:20:38.782508Z broker.debug[0]: insmod job-manager
2022-01-12T22:20:38.792111Z job-manager.info[0]: restart: 0 jobs
2022-01-12T22:20:38.792188Z job-manager.info[0]: restart: 0 running jobs
2022-01-12T22:20:38.792370Z job-manager.info[0]: restart: no checkpoint object
2022-01-12T22:20:38.792422Z job-manager.debug[0]: restart: max_jobid=0
2022-01-12T22:20:38.889039Z broker.debug[0]: insmod job-info
2022-01-12T22:20:38.998763Z broker.debug[0]: insmod job-list
2022-01-12T22:20:39.008541Z job-list.debug[0]: job_state_init_from_kvs: read 0 jobs
2022-01-12T22:20:39.111664Z broker.debug[0]: insmod job-archive
2022-01-12T22:20:39.219286Z broker.debug[0]: insmod job-ingest
2022-01-12T22:20:39.228903Z job-ingest.debug[0]: fluid ts=1ms
2022-01-12T22:20:39.331553Z broker.debug[0]: insmod job-exec
2022-01-12T22:20:39.339706Z job-exec.debug[0]: using default shell path /usr/src/src/shell/flux-shell
2022-01-12T22:20:39.440931Z broker.debug[0]: insmod heartbeat
2022-01-12T22:20:39.452020Z broker.info[0]: rc1.0: running /usr/src/etc/rc1.d/02-cron
2022-01-12T22:20:39.732055Z broker.debug[0]: insmod sched-simple
2022-01-12T22:20:39.741430Z sched-simple.debug[0]: service_register
2022-01-12T22:20:39.742839Z sched-simple.info[0]: resource update: {"resources":{"version":1,"execution":{"R_lite":[{"rank":"0","children":{"core":"0"}}],"starttime":0.0,"expiration":0.0,"nodelist":["asp"]}},"up":""}
2022-01-12T22:20:39.743022Z job-manager.debug[0]: scheduler: hello
2022-01-12T22:20:39.743236Z job-manager.debug[0]: scheduler: ready limited
2022-01-12T22:20:39.743421Z sched-simple.debug[0]: ready: 0 of 1 cores: 
2022-01-12T22:20:40.023445Z broker.info[0]: rc1.0: /bin/bash -c /usr/src/etc/rc1 Exited (rc=0) 2.1s
2022-01-12T22:20:40.023575Z broker.info[0]: rc1-success: init->quorum 2.12084s
2022-01-12T22:20:40.124318Z broker.debug[0]: groups: set broker.online 0
2022-01-12T22:20:40.124461Z broker.info[0]: quorum-full: quorum->run 0.100849s
2022-01-12T22:20:40.127563Z resource.debug[0]: reslog_cb: online event posted
2022-01-12T22:20:40.127742Z sched-simple.info[0]: resource update: {"up":"0"}
2022-01-12T22:20:41.153464Z sched-simple.debug[0]: req: 32078036992: spec={0,1,1} duration=0.0
2022-01-12T22:20:41.153573Z sched-simple.debug[0]: alloc: 32078036992: rank0/core0
ƒqsdKpK
2022-01-12T22:20:46.144547Z job-ingest.debug[0]: validator[0]: inactivity timeout
2022-01-12T22:20:47.001237Z broker.info[0]: signal 15
2022-01-12T22:20:47.002179Z broker.err[0]: rc2.0: /bin/bash -c flux mini submit sleep 2000 && flux queue idle Terminated (rc=143) 6.9s
2022-01-12T22:20:47.002236Z broker.info[0]: rc2-fail: run->cleanup 6.87776s
2022-01-12T22:20:47.062162Z broker.info[0]: cleanup.0: /bin/bash -c flux queue stop --quiet Exited (rc=0) 0.1s
2022-01-12T22:20:47.113886Z job-exec.debug[0]: exec aborted: id=32078036992
2022-01-12T22:20:47.113927Z job-exec.debug[0]: exec_kill: 32078036992: signal 15
2022-01-12T22:20:47.123735Z broker.err[0]: cleanup.1: flux-job: Canceled 1 jobs (0 errors)
2022-01-12T22:20:47.125647Z broker.info[0]: cleanup.1: /bin/bash -c flux job cancelall --user=all --quiet -f --states RUN Exited (rc=0) 0.1s
2022-01-12T22:20:47.128773Z sched-simple.debug[0]: free: rank0/core0
2022-01-12T22:20:47.187502Z broker.info[0]: cleanup.2: /bin/bash -c flux queue idle --quiet Exited (rc=0) 0.1s
2022-01-12T22:20:47.187616Z broker.info[0]: cleanup-success: cleanup->shutdown 0.185365s
2022-01-12T22:20:47.187715Z broker.info[0]: children-none: shutdown->finalize 8.4364e-05s
2022-01-12T22:20:47.258375Z broker.debug[0]: rmmod heartbeat
2022-01-12T22:20:47.258771Z broker.debug[0]: module heartbeat exited
2022-01-12T22:20:47.301139Z broker.debug[0]: rmmod sched-simple
2022-01-12T22:20:47.301368Z sched-simple.debug[0]: service_unregister
2022-01-12T22:20:47.301620Z broker.debug[0]: module sched-simple exited
2022-01-12T22:20:47.302030Z job-manager.debug[0]: alloc: stop due to disconnect: Success
2022-01-12T22:20:47.301963Z resource.debug[0]: aborted 1 resource.acquire(s)
2022-01-12T22:20:47.343911Z broker.debug[0]: rmmod resource
2022-01-12T22:20:47.344185Z broker.debug[0]: module resource exited
2022-01-12T22:20:47.386242Z broker.debug[0]: rmmod job-archive
2022-01-12T22:20:47.386528Z broker.debug[0]: module job-archive exited
2022-01-12T22:20:47.427505Z broker.debug[0]: rmmod job-exec
2022-01-12T22:20:47.427916Z broker.debug[0]: module job-exec exited
2022-01-12T22:20:47.469381Z broker.debug[0]: rmmod job-list
2022-01-12T22:20:47.469661Z broker.debug[0]: module job-list exited
2022-01-12T22:20:47.511964Z broker.debug[0]: rmmod job-info
2022-01-12T22:20:47.512235Z broker.debug[0]: module job-info exited
2022-01-12T22:20:47.553373Z broker.debug[0]: rmmod job-manager
2022-01-12T22:20:47.554531Z broker.debug[0]: module job-manager exited
2022-01-12T22:20:47.596270Z broker.debug[0]: rmmod job-ingest
2022-01-12T22:20:47.596571Z broker.debug[0]: module job-ingest exited
2022-01-12T22:20:47.637615Z broker.debug[0]: rmmod cron
2022-01-12T22:20:47.637971Z broker.debug[0]: module cron exited
2022-01-12T22:20:47.681221Z broker.debug[0]: rmmod barrier
2022-01-12T22:20:47.681460Z broker.debug[0]: module barrier exited
2022-01-12T22:20:47.723164Z broker.debug[0]: rmmod kvs-watch
2022-01-12T22:20:47.723447Z broker.debug[0]: module kvs-watch exited
2022-01-12T22:20:47.764767Z broker.debug[0]: rmmod kvs
2022-01-12T22:20:47.765480Z broker.debug[0]: module kvs exited
2022-01-12T22:20:47.851912Z broker.debug[0]: rmmod content-sqlite
2022-01-12T22:20:47.852029Z broker.debug[0]: content backing store: disabled
2022-01-12T22:20:47.852342Z broker.debug[0]: module content-sqlite exited
2022-01-12T22:20:47.853464Z broker.info[0]: rc3.0: /bin/bash -c /usr/src/etc/rc3 Exited (rc=0) 0.7s
2022-01-12T22:20:47.853521Z broker.info[0]: rc3-success: finalize->exit 0.665792s
2022-01-12T22:20:47.853569Z broker.debug[0]: rmmod connector-local
2022-01-12T22:20:47.853866Z broker.debug[0]: module connector-local exited

grondo commented 2 years ago

There are several problems I've observed here:

The broker gets SIGTERM more than once. This is probably because the top-level broker sends SIGTERM to the shell at job termination with killpg(2) and this SIGTERM likely gets delivered to the child job broker(s), then the shell forwards that signal to job tasks (SIGTERM delivered again)
The extra SIGTERMs (or some other race) terminates "cleanup" jobs and/or rc3 which signal and wait for jobs to finish
When any ancestor job shell exits before any of the descendant brokers, the job tmpdir hierarchy is removed and the connector sockets get pulled out from under the brokers, causing errors in cleanup and rc3 scripts.

2022-01-14T15:28:08.152549Z broker.info[0]: signal 15
state_machine_kill state=4
2022-01-14T15:28:08.153641Z broker.err[0]: rc2.0: flux mini run sleep 100 Terminated (rc=143) 57.5s
2022-01-14T15:28:08.153728Z broker.info[0]: rc2-fail: run->cleanup 57.4789s
2022-01-14T15:28:08.167864Z broker.info[0]: signal 15
2022-01-14T15:28:08.270465Z broker.info[0]: cleanup.0: /bin/bash -c flux queue stop --quiet Exited (rc=0) 0.1s
2022-01-14T15:28:08.296005Z broker.info[0]: signal 15
2022-01-14T15:28:08.300358Z broker.err[0]: cleanup.1: /bin/bash -c flux job cancelall --user=all --quiet -f --states RUN Terminated (rc=143) 0.0s
2022-01-14T15:28:08.344153Z broker.info[0]: signal 15
2022-01-14T15:28:08.344784Z broker.err[0]: cleanup.2: /bin/bash -c flux queue idle --quiet Terminated (rc=143) 0.0s
2022-01-14T15:28:08.344905Z broker.info[0]: cleanup-fail: cleanup->shutdown 0.191155s
2022-01-14T15:28:08.345043Z broker.info[0]: children-none: shutdown->finalize 0.0001204s
2022-01-14T15:28:08.890271Z broker.err[0]: rc3.0: flux: flux_open: No such file or directory
2022-01-14T15:28:08.891980Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:08.892042Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:09.429647Z broker.err[0]: rc3.0: flux-module: flux_open: No such file or directory
2022-01-14T15:28:09.430685Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:09.430742Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:09.430778Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:09.966650Z broker.err[0]: rc3.0: flux-module: flux_open: No such file or directory

grondo commented 2 years ago

Another possible issue: When a job is terminated, SIGTERM is sent to all ranks. For a multi-rank Flux instance, this means all brokers get SIGTERM (possibly multiple times, see above), not just rank 0. This could cause problems in orderly shutdown of the instance.

grondo commented 2 years ago

I could be misreading the code, but another potential issue is that it appears that if the broker gets a termination signal while running cleanup tasks or rc3, then those processes are terminated.

I (probably naively) think that cleanup tasks and rc3 should not be prematurely terminated if the broker receives a signal while they are running. For example, if the cleanup task that terminates all jobs is killed, then cleanup of subinstances won't occur, or if the cleanup task that waits for all jobs is terminated then the instance will exit before all sub-jobs are fully complete.

I've probably forgotten some other cases we need to handle during termination though.

grondo commented 2 years ago

Also, another related thought: Currently when the broker gets SIGTERM it terminates the initial program, which seems reasonable. However, a case where the initial program is a batch script, and is running something like flux mini run or flux mini submit --watch it might be better to terminate all jobs when SIGTERM is received, but allow rc2 to continue running (under some grace period). This would let flux mini run or flux mini submit run to completion to process any pending output and report the exit status of tasks.

I'm not sure this is even possible with the way the broker state machine works, but just thought I'd mention it while I'm thinking about it.

flux-framework / flux-core

cancellation of job hierarchy doesn't fully clean up #4047