Closed garlick closed 3 weeks ago
This can't be merged until a flux-security v0.12 is tagged which supports SIGUSR1 in the IMP. (SIGUSR1 will cause the current IMP to terminate immediately)
Right, sorry, probably should've been a WIP.
Wow, don't know how I missed all that. OK, updated, and since the flux-security 0.12 tag was just pushed, this actually has a chance of passing CI so we'll see.
Some t9000-system.t
tests are failing where we check that flux exec --with-imp
forwards signals. Here's an excerpt from one failing test that uses the IMP to run a shell script that calls sleep:
sending signal 2 to 1 running processes
sudo timed out after 30.0s
test_expect_code: command exited with 137, we wanted 130 run_timeout 30 sudo -u flux ./test_signal.sh INT
not ok 8 - 0002-exec-with-imp.t: flux exec --with-imp forwards signals
I can recreate this environment this on my test system by creating a shell script that runs sleep, configuring it in the IMP, and manually running flux exec --with-imp
, and sure enough, SIGINT to flux-exec
appears to have no effect, and sending a SIGTERM kills the shell and the IMP but leaves the sleep and doesn't terminate.
I did observe:
sleep
command, which in turn prevented the subprocess server from finalizing the exec stream with ENODATA because the stdout/stderr streams were still open.When I run the same script directly with flux exec
(no IMP), those signals cause everything to wrap up as expected.
Still pondering this one - just wanted to post an update!
Possibly the IMP's internal fwd_signal()
needs to behave differently when the target is not a shell. Currently it treats SIGUSR1 (surrogate for SIGKILL) specially and assumes it (the IMP) is responsible for cleaning up its child and grandchildren. Other signals are delegated to the direct child for distribution, which sounds right for flux-shell
but perhaps not for a shell script. Maybe it needs a flag so when it is called from imp run
it assumes it is cleaning up everything for any signal.
Another observation is that flux_subprocess_kill()
by default uses killpg()
while the IMP uses kill()
when signaling the direct child.
After chatting with @grondo, I opened flux-framework/flux-security#194 to change the signal forwarding behavior of imp run
.
Updated to require flux-security 0.13, which should address the test failure. :crossed_fingers:
I'll set MWP on this one too.
Attention: Patch coverage is 78.57143%
with 3 lines
in your changes missing coverage. Please review.
Project coverage is 83.62%. Comparing base (
9411280
) to head (f6cf51d
). Report is 8 commits behind head on master.
Files with missing lines | Patch % | Lines |
---|---|---|
src/cmd/flux-exec.c | 71.42% | 2 Missing :warning: |
src/modules/job-exec/job-exec.c | 80.00% | 1 Missing :warning: |
Problem: job-exec uses
flux imp kill
to deliver SIGKILL to the flux-shell when shell signaling methods fail to clean up a multi-user job, but theflux imp kill
sub-command is being deprecated in favor of having the IMP forward signals (per RFC 15).This changes job-exec to send SIGUSR1 (which RFC 15 defines as a proxy for SIGKILL) directly to the IMP in that case.
To make it easier to coordinate the flux-core and flux-security changes, we'll add the #6409 fix here as well.