update multi-user signaling to track flux-security 0.13.0 IMP changes

garlick commented 3 weeks ago

Problem: job-exec uses flux imp kill to deliver SIGKILL to the flux-shell when shell signaling methods fail to clean up a multi-user job, but the flux imp kill sub-command is being deprecated in favor of having the IMP forward signals (per RFC 15).

This changes job-exec to send SIGUSR1 (which RFC 15 defines as a proxy for SIGKILL) directly to the IMP in that case.

To make it easier to coordinate the flux-core and flux-security changes, we'll add the #6409 fix here as well.

grondo commented 3 weeks ago

This can't be merged until a flux-security v0.12 is tagged which supports SIGUSR1 in the IMP. (SIGUSR1 will cause the current IMP to terminate immediately)

garlick commented 3 weeks ago

Right, sorry, probably should've been a WIP.

garlick commented 3 weeks ago

Wow, don't know how I missed all that. OK, updated, and since the flux-security 0.12 tag was just pushed, this actually has a chance of passing CI so we'll see.

garlick commented 3 weeks ago

Some t9000-system.t tests are failing where we check that flux exec --with-imp forwards signals. Here's an excerpt from one failing test that uses the IMP to run a shell script that calls sleep:

sending signal 2 to 1 running processes
sudo timed out after 30.0s
test_expect_code: command exited with 137, we wanted 130 run_timeout 30 sudo -u flux ./test_signal.sh INT
not ok 8 - 0002-exec-with-imp.t: flux exec --with-imp forwards signals

I can recreate this environment this on my test system by creating a shell script that runs sleep, configuring it in the IMP, and manually running flux exec --with-imp, and sure enough, SIGINT to flux-exec appears to have no effect, and sending a SIGTERM kills the shell and the IMP but leaves the sleep and doesn't terminate.

I did observe:

manually sending SIGINT to the shell (script) also has no effect
manually sending SIGTERM to the shell (script) causes it to terminate but orphans the sleep command, which in turn prevented the subprocess server from finalizing the exec stream with ENODATA because the stdout/stderr streams were still open.

When I run the same script directly with flux exec (no IMP), those signals cause everything to wrap up as expected.

Still pondering this one - just wanted to post an update!

garlick commented 3 weeks ago

Possibly the IMP's internal fwd_signal() needs to behave differently when the target is not a shell. Currently it treats SIGUSR1 (surrogate for SIGKILL) specially and assumes it (the IMP) is responsible for cleaning up its child and grandchildren. Other signals are delegated to the direct child for distribution, which sounds right for flux-shell but perhaps not for a shell script. Maybe it needs a flag so when it is called from imp run it assumes it is cleaning up everything for any signal.

Another observation is that flux_subprocess_kill() by default uses killpg() while the IMP uses kill() when signaling the direct child.

garlick commented 3 weeks ago

After chatting with @grondo, I opened flux-framework/flux-security#194 to change the signal forwarding behavior of imp run.

garlick commented 3 weeks ago

Updated to require flux-security 0.13, which should address the test failure. :crossed_fingers:

garlick commented 3 weeks ago

I'll set MWP on this one too.

codecov[bot] commented 3 weeks ago

Codecov Report

Attention: Patch coverage is 78.57143% with 3 lines in your changes missing coverage. Please review.

Project coverage is 83.62%. Comparing base (9411280) to head (f6cf51d). Report is 8 commits behind head on master.

Files with missing lines	Patch %	Lines
src/cmd/flux-exec.c	71.42%	2 Missing :warning:
src/modules/job-exec/job-exec.c	80.00%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #6408 +/- ## =========================================== + Coverage 53.96% 83.62% +29.65% =========================================== Files 476 524 +48 Lines 80288 87603 +7315 =========================================== + Hits 43328 73254 +29926 + Misses 36960 14349 -22611 ``` | [Files with missing lines](https://app.codecov.io/gh/flux-framework/flux-core/pull/6408?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [src/common/libsubprocess/bulk-exec.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6408?src=pr&el=tree&filepath=src%2Fcommon%2Flibsubprocess%2Fbulk-exec.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2NvbW1vbi9saWJzdWJwcm9jZXNzL2J1bGstZXhlYy5j) | `83.42% <100.00%> (+22.29%)` | :arrow_up: | | [src/modules/job-exec/exec.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6408?src=pr&el=tree&filepath=src%2Fmodules%2Fjob-exec%2Fexec.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL21vZHVsZXMvam9iLWV4ZWMvZXhlYy5j) | `81.84% <ø> (+34.22%)` | :arrow_up: | | [src/modules/job-manager/housekeeping.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6408?src=pr&el=tree&filepath=src%2Fmodules%2Fjob-manager%2Fhousekeeping.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL21vZHVsZXMvam9iLW1hbmFnZXIvaG91c2VrZWVwaW5nLmM=) | `82.67% <ø> (+68.39%)` | :arrow_up: | | [src/modules/job-manager/plugins/perilog.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6408?src=pr&el=tree&filepath=src%2Fmodules%2Fjob-manager%2Fplugins%2Fperilog.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL21vZHVsZXMvam9iLW1hbmFnZXIvcGx1Z2lucy9wZXJpbG9nLmM=) | `82.72% <ø> (+40.90%)` | :arrow_up: | | [src/modules/job-exec/job-exec.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6408?src=pr&el=tree&filepath=src%2Fmodules%2Fjob-exec%2Fjob-exec.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL21vZHVsZXMvam9iLWV4ZWMvam9iLWV4ZWMuYw==) | `77.01% <80.00%> (+36.25%)` | :arrow_up: | | [src/cmd/flux-exec.c](https://app.codecov.io/gh/flux-framework/flux-core/pull/6408?src=pr&el=tree&filepath=src%2Fcmd%2Fflux-exec.c&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-c3JjL2NtZC9mbHV4LWV4ZWMuYw==) | `81.69% <71.42%> (+19.49%)` | :arrow_up: | ... and [441 files with indirect coverage changes](https://app.codecov.io/gh/flux-framework/flux-core/pull/6408/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework)

flux-framework / flux-core

update multi-user signaling to track flux-security 0.13.0 IMP changes #6408

Codecov Report