t2410-sdexec-memlimit.t hangs after job-exec switched to FLUX_SUBPROCESS_FLAGS_UNBUF

Problem: after #5975 was merged, I noticed that t2410-sdexec-memlimit.t hangs here:

...
ok 9 - remaining within memory.max works

expecting success: 
    test_expect_code_or_killed 1 $sdexec \
        --setopt=SDEXEC_PROP_MemoryMax=100M \
        $stress --timeout 60 --vm-keep --vm 1 --vm-bytes 200M

stress: info: [211138] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
0: Terminated
ok 10 - exceeding memory.max causes exec failure

expecting success: 
    test_expect_code_or_killed 1 flux run \
        $stress --timeout 60 --vm-keep --vm 1 --vm-bytes 200M

stress: info: [211181] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [211181] (416) <-- worker 211182 got signal 9
stress: WARN: [211181] (418) now reaping child worker processes
stress: FAIL: [211181] (452) failed run completed in 0s

Reverting the change to job-exec fixes it:

diff --git a/src/modules/job-exec/bulk-exec.c b/src/modules/job-exec/bulk-exec.c
index b5741d2d9..a5a897601 100644
--- a/src/modules/job-exec/bulk-exec.c
+++ b/src/modules/job-exec/bulk-exec.c
@@ -265,7 +265,7 @@ static struct exec_cmd *exec_cmd_create (const struct idset *ranks,
         goto err;
     }
     /* bulk-exec always uses unbuffered reads for performance */
-    c->flags = flags | FLUX_SUBPROCESS_FLAGS_LOCAL_UNBUF;
+    c->flags = flags;
     return (c);
 err:
     exec_cmd_destroy (c);

In the hung test, I can access the broker.

ƒ(s=1,d=0) garlick@system76-pc:/proc/205157$ flux jobs
       JOBID USER     NAME       ST NTASKS NNODES     TIME INFO
    ƒ2w8vgHm garlick  stress      R      1      1   3.162m system76-pc
ƒ(s=1,d=0) garlick@system76-pc:/proc/205157$ flux job eventlog -p exec $(flux job last)
1717123824.263584 init
1717123824.264089 starting
1717123824.336121 shell.init service="5588-shell-f2w8vgHm" leader-rank=0 size=1
1717123824.337742 shell.start taskmap={"version":1,"map":[[0,1,1,1]]}
1717123824.397937 shell.task-exit localid=0 rank=0 state="Exited" pid=211181 wait_status=256 signaled=0 exitcode=1

Hmm, we're missing the complete event.

I'm trying to run down what's going on here. One thing I've noticed is that, in the hanging test, the watcher for the stdout/stderr channels for the sdexec process gets called again after flux_watcher_stop() has been called on it. How that's possible I do not know!

https://github.com/flux-framework/flux-core/blob/master/src/common/libsdexec/channel.c#L72

As a result, the shell receives two callbacks for eof on each of stdout and stderr instead of one each. If I take extra steps to suppress action after the watcher has been stopped, then the test passes so this may be the root cause of the test failure, although I don't quite understand how it would prevent the subprocess from cleaning up, or why only with UNBUF.

Here's the failing test with some debug added (for my own notes more than anything). The subprocess completion callback in bulk exec was instrumented and it is NOT called, however, note that sdexec sent an ENODATA response to the exec request.

expecting success: 
    test_expect_code_or_killed 1 flux run \
        $stress --timeout 60 --vm-keep --vm 1 --vm-bytes 200M

May 31 15:01:06.300022 sched-simple.debug[0]: req: f2wGLchV: spec={0,1,1} duration=0.0
May 31 15:01:06.300088 sched-simple.debug[0]: alloc: f2wGLchV: rank0/core0
May 31 15:01:06.302007 sdexec.debug[0]: watch shell-0-f2wGLchV.service
May 31 15:01:06.302020 sdexec.debug[0]: start shell-0-f2wGLchV.service
May 31 15:01:06.323383 sdexec.debug[0]: shell-0-f2wGLchV.service: unknown.unknown
May 31 15:01:06.323454 sdexec.debug[0]: shell-0-f2wGLchV.service: inactive.dead
May 31 15:01:06.323582 sdexec.debug[0]: shell-0-f2wGLchV.service: activating.start
May 31 15:01:06.324205 sdexec.debug[0]: shell-0-f2wGLchV.service: active.running
May 31 15:01:06.391603 sdexec.debug[0]: shell-0-f2wGLchV.service: deactivating.unknown
stress: info: [1413809] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
0.114s: flux-shell[0]: ERROR: oom: Memory cgroup out of memory: killed 1 task.
May 31 15:01:06.403594 job-exec.debug[0]: XXX 0x7817f00160e0 stderr: '' [EOF]
May 31 15:01:06.403609 job-exec.debug[0]: XXX 0x7817f00160e0 stdout: '' [EOF]
May 31 15:01:06.403721 sdexec.debug[0]: shell-0-f2wGLchV.service: deactivating.unknown
May 31 15:01:06.403781 sdexec.debug[0]: shell-0-f2wGLchV.service: failed.failed
May 31 15:01:06.403786 sdexec.debug[0]: reset-failed shell-0-f2wGLchV.service
May 31 15:01:06.403808 sdexec.err[0]: stderr: XXX watcher called after stop!
May 31 15:01:06.403818 sdexec.err[0]: stdout: XXX watcher called after stop!
May 31 15:01:06.403850 job-exec.debug[0]: XXX 0x7817f00160e0 stderr: '' [EOF]
May 31 15:01:06.403860 job-exec.debug[0]: XXX 0x7817f00160e0 stdout: '' [EOF]
May 31 15:01:06.404176 sdexec.debug[0]: shell-0-f2wGLchV.service: inactive.dead
May 31 15:01:06.404186 sdexec.err[0]: XXX responding to sdexec.exec with No data available: (null)
May 31 15:01:06.404224 sdexec.debug[0]: unwatch shell-0-f2wGLchV.service

OK, sorry for all the noise thinking out loud. It did help me get some clarity. It turns out that sdexec is restarting the fd watcher to cover an error case where it might not have been started earlier. If I prevent that, then subprocess is happy. I'll submit a PR to address this.

Maybe we should also run down why subprocess isn't completing as it is yet another case of a server implementation doing something that's not explicitly forbidden in the RFC and seems like it should be harmless. Thoughts @chu11?

I think I know what happened, by sending EOF twice

    /* N.B. any data not consumed by the user is lost, so if eof is                                                                         
     * seen, we send it immediately */                                                                                                      
    if (eof) {                                                                                                                              
        c->read_eof_received = true;                                                                                                        
        c->unbuf_data = NULL;                                                                                                               
        c->unbuf_len = 0;                                                                                                                   

        c->output_cb (c->p, c->name);                                                                                                       

        c->eof_sent_to_caller = true;                                                                                                       
        c->p->channels_eof_sent++;                                                                                                          
    }

channels_eof_sent is incremented twice leading to some bad comparison checks later on

in the buffered case

    if (!fbuf_bytes (c->read_buffer)                                                                                                        
        && c->read_eof_received                                                                                                             
        && !c->eof_sent_to_caller) {                                                                                                        
        c->output_cb (c->p, c->name);                                                                                                       
        c->eof_sent_to_caller = true;                                                                                                       
        c->p->channels_eof_sent++;                                                                                                          
    }

we have the eof_sent_to_caller protection flag. Lemme fix this up

flux-framework / flux-core

t2410-sdexec-memlimit.t hangs after job-exec switched to FLUX_SUBPROCESS_FLAGS_UNBUF #6019