Open grondo opened 2 years ago
There are several problems I've observed here:
killpg(2)
and this SIGTERM likely gets delivered to the child job broker(s), then the shell forwards that signal to job tasks (SIGTERM delivered again)2022-01-14T15:28:08.152549Z broker.info[0]: signal 15
state_machine_kill state=4
2022-01-14T15:28:08.153641Z broker.err[0]: rc2.0: flux mini run sleep 100 Terminated (rc=143) 57.5s
2022-01-14T15:28:08.153728Z broker.info[0]: rc2-fail: run->cleanup 57.4789s
2022-01-14T15:28:08.167864Z broker.info[0]: signal 15
2022-01-14T15:28:08.270465Z broker.info[0]: cleanup.0: /bin/bash -c flux queue stop --quiet Exited (rc=0) 0.1s
2022-01-14T15:28:08.296005Z broker.info[0]: signal 15
2022-01-14T15:28:08.300358Z broker.err[0]: cleanup.1: /bin/bash -c flux job cancelall --user=all --quiet -f --states RUN Terminated (rc=143) 0.0s
2022-01-14T15:28:08.344153Z broker.info[0]: signal 15
2022-01-14T15:28:08.344784Z broker.err[0]: cleanup.2: /bin/bash -c flux queue idle --quiet Terminated (rc=143) 0.0s
2022-01-14T15:28:08.344905Z broker.info[0]: cleanup-fail: cleanup->shutdown 0.191155s
2022-01-14T15:28:08.345043Z broker.info[0]: children-none: shutdown->finalize 0.0001204s
2022-01-14T15:28:08.890271Z broker.err[0]: rc3.0: flux: flux_open: No such file or directory
2022-01-14T15:28:08.891980Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:08.892042Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:09.429647Z broker.err[0]: rc3.0: flux-module: flux_open: No such file or directory
2022-01-14T15:28:09.430685Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:09.430742Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:09.430778Z broker.err[0]: rc3.0: /usr/src/etc/rc3: line 8: test: 0: unary operator expected
2022-01-14T15:28:09.966650Z broker.err[0]: rc3.0: flux-module: flux_open: No such file or directory
Another possible issue: When a job is terminated, SIGTERM is sent to all ranks. For a multi-rank Flux instance, this means all brokers get SIGTERM (possibly multiple times, see above), not just rank 0. This could cause problems in orderly shutdown of the instance.
I could be misreading the code, but another potential issue is that it appears that if the broker gets a termination signal while running cleanup tasks or rc3, then those processes are terminated.
I (probably naively) think that cleanup tasks and rc3 should not be prematurely terminated if the broker receives a signal while they are running. For example, if the cleanup task that terminates all jobs is killed, then cleanup of subinstances won't occur, or if the cleanup task that waits for all jobs is terminated then the instance will exit before all sub-jobs are fully complete.
I've probably forgotten some other cases we need to handle during termination though.
Also, another related thought: Currently when the broker gets SIGTERM it terminates the initial program, which seems reasonable. However, a case where the initial program is a batch script, and is running something like flux mini run
or flux mini submit --watch
it might be better to terminate all jobs when SIGTERM is received, but allow rc2 to continue running (under some grace period). This would let flux mini run
or flux mini submit
run to completion to process any pending output and report the exit status of tasks.
I'm not sure this is even possible with the way the broker state machine works, but just thought I'd mention it while I'm thinking about it.
I noticed after
make check
on recent master I have orphan jobs left around that were part of tests. Some digging revealed these were specifically left over from thet2803-pstree.t
"start a recursive job" test:It appears that when this this nested job is terminated at the end of the sharness test, more times than not, some of the child job's processes or even brokers remain running.
This can be reproduced via the following:
The command above captures output from the top level instance in
level1.out
and the sub-instances inlogfile.1
andlogfile.2
. For the "failing" instance still seen running above, the output has only the following:The sub-instance that did exit normally has the following log: