Closed garlick closed 4 months ago
I'm trying to run down what's going on here. One thing I've noticed is that, in the hanging test, the watcher for the stdout/stderr channels for the sdexec process gets called again after flux_watcher_stop()
has been called on it. How that's possible I do not know!
https://github.com/flux-framework/flux-core/blob/master/src/common/libsdexec/channel.c#L72
As a result, the shell receives two callbacks for eof on each of stdout and stderr instead of one each. If I take extra steps to suppress action after the watcher has been stopped, then the test passes so this may be the root cause of the test failure, although I don't quite understand how it would prevent the subprocess from cleaning up, or why only with UNBUF.
Here's the failing test with some debug added (for my own notes more than anything). The subprocess completion callback in bulk exec was instrumented and it is NOT called, however, note that sdexec sent an ENODATA response to the exec request.
expecting success:
test_expect_code_or_killed 1 flux run \
$stress --timeout 60 --vm-keep --vm 1 --vm-bytes 200M
May 31 15:01:06.300022 sched-simple.debug[0]: req: f2wGLchV: spec={0,1,1} duration=0.0
May 31 15:01:06.300088 sched-simple.debug[0]: alloc: f2wGLchV: rank0/core0
May 31 15:01:06.302007 sdexec.debug[0]: watch shell-0-f2wGLchV.service
May 31 15:01:06.302020 sdexec.debug[0]: start shell-0-f2wGLchV.service
May 31 15:01:06.323383 sdexec.debug[0]: shell-0-f2wGLchV.service: unknown.unknown
May 31 15:01:06.323454 sdexec.debug[0]: shell-0-f2wGLchV.service: inactive.dead
May 31 15:01:06.323582 sdexec.debug[0]: shell-0-f2wGLchV.service: activating.start
May 31 15:01:06.324205 sdexec.debug[0]: shell-0-f2wGLchV.service: active.running
May 31 15:01:06.391603 sdexec.debug[0]: shell-0-f2wGLchV.service: deactivating.unknown
stress: info: [1413809] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
0.114s: flux-shell[0]: ERROR: oom: Memory cgroup out of memory: killed 1 task.
May 31 15:01:06.403594 job-exec.debug[0]: XXX 0x7817f00160e0 stderr: '' [EOF]
May 31 15:01:06.403609 job-exec.debug[0]: XXX 0x7817f00160e0 stdout: '' [EOF]
May 31 15:01:06.403721 sdexec.debug[0]: shell-0-f2wGLchV.service: deactivating.unknown
May 31 15:01:06.403781 sdexec.debug[0]: shell-0-f2wGLchV.service: failed.failed
May 31 15:01:06.403786 sdexec.debug[0]: reset-failed shell-0-f2wGLchV.service
May 31 15:01:06.403808 sdexec.err[0]: stderr: XXX watcher called after stop!
May 31 15:01:06.403818 sdexec.err[0]: stdout: XXX watcher called after stop!
May 31 15:01:06.403850 job-exec.debug[0]: XXX 0x7817f00160e0 stderr: '' [EOF]
May 31 15:01:06.403860 job-exec.debug[0]: XXX 0x7817f00160e0 stdout: '' [EOF]
May 31 15:01:06.404176 sdexec.debug[0]: shell-0-f2wGLchV.service: inactive.dead
May 31 15:01:06.404186 sdexec.err[0]: XXX responding to sdexec.exec with No data available: (null)
May 31 15:01:06.404224 sdexec.debug[0]: unwatch shell-0-f2wGLchV.service
OK, sorry for all the noise thinking out loud. It did help me get some clarity. It turns out that sdexec
is restarting the fd watcher to cover an error case where it might not have been started earlier. If I prevent that, then subprocess is happy. I'll submit a PR to address this.
Maybe we should also run down why subprocess isn't completing as it is yet another case of a server implementation doing something that's not explicitly forbidden in the RFC and seems like it should be harmless. Thoughts @chu11?
I think I know what happened, by sending EOF twice
/* N.B. any data not consumed by the user is lost, so if eof is
* seen, we send it immediately */
if (eof) {
c->read_eof_received = true;
c->unbuf_data = NULL;
c->unbuf_len = 0;
c->output_cb (c->p, c->name);
c->eof_sent_to_caller = true;
c->p->channels_eof_sent++;
}
channels_eof_sent
is incremented twice leading to some bad comparison checks later on
in the buffered case
if (!fbuf_bytes (c->read_buffer)
&& c->read_eof_received
&& !c->eof_sent_to_caller) {
c->output_cb (c->p, c->name);
c->eof_sent_to_caller = true;
c->p->channels_eof_sent++;
}
we have the eof_sent_to_caller
protection flag. Lemme fix this up
Problem: after #5975 was merged, I noticed that
t2410-sdexec-memlimit.t
hangs here:Reverting the change to job-exec fixes it:
In the hung test, I can access the broker.
Hmm, we're missing the
complete
event.