Open chu11 opened 2 weeks ago
See also this libsubprocess unit test
It runs cat
as a subprocess and tries to ship data to and from it.
bool iostress_run_check (flux_t *h,
const char *name,
bool direct,
int stdin_bufsize,
int stdout_bufsize,
int batchcount,
int batchlines,
size_t linesize)
In that invocation,
iostress_run_check (h, "tinystdout", false, 0, 128, 1, 1, 256)
stdin_bufsize
is the default size but stdout_bufsize
is 128 bytes. Then we send a 256 character line. The result is deadlock.
So as for my initial issue
* By default, stdio and channels use an internal buffer of 4 megs.
so that explains the hang at > 4 megs.
I think I figured it out. If I'm right, it's so dumb.
By default subprocesses are line buffered. If we fill up the buffer and there's no newline, no output callbacks will ever happen. Yadda yadda yadda, completion callback won't get triggered.
I'm not really sure the cleanest / best way to deal with it. Will take some thought.
Maybe send data on buffer full even though it won't have a newline?
Edit: don't forget about the unit test referenced above. It's a simple reproducer for this.
Maybe send data on buffer full even though it won't have a newline?
That was my first thought. It would maybe be fine for a lot of users b/c they probably just send the buffer to printf("%s", buf);
but could result in some weird corner cases. Wish we could inform the user that this happened though.
Edit: don't forget about the unit test referenced above. It's a simple reproducer for this.
Yup, that's how I figured this out :-)
IMHO that's likely the proper fix.
FWIW I used that logic in sdexec output buffering. There didn't seem to be any other possible way to go.
https://github.com/flux-framework/flux-core/blob/master/src/common/libsdexec/channel.c#L93
Ugh, when implementing what I think we should implement hit
# LOG: remote.c:460 remote_output_buffered(): Error buffering 128 bytes received from remote subprocess pid 3014323 stdout: No space left on device
The issue is the output cb is not called before a second attempt to put data into the buffer is done. Thus hitting this "we're out of space" error.
This is my debugging.
# LOG: remote.c:432 remote_output_buffered(): output buffered stdout 128
# LOG: remote.c:471 remote_output_buffered(): turn on prep check
# LOG: remote.c:179 remote_out_prep_cb(): prep stdout
# LOG: remote.c:432 remote_output_buffered(): output buffered stdout 128
so read 128 bytes into buffer, turn on prep/check, but for this "round" of the EV the check is skipped. prep turns on idle watcher, then a second 128 bytes is read and we hit the error.
Perhaps an error is the best? Even if we do call the output cb, there's no guarantee the user will read data out of it. Will ponder a bit more.
I would think flushing the buffer immediately would be the answer.
yeah, i'm thinking an "emergency" call to the output callback for the user can be done. They get one attempt to get data out of the buffer. If they don't, we hit the error shown above.
Also requires that flux_subprocess_read_line()
(and friends) also can return a non-line. I really dislike this, thus wishing there was some type of &bool_is_this_really_a_line
in and out parameter. But this error is probably going to be rare enough, we can just live with.
Isn't it the same situation if EOF is reached without a line terminator? User gets an unterminated line?
Seems OK to me.
While working on something else, I noticed
t/shell/lptest 4000000 1000
worked as a job under flux butt/shell/lptest 5000000 1000
did not. The latter would just hang and produce no output.Only later did I realize I messed up. I should have done
t/shell/lptest 1000 4000000
(the first number is the length of a line, the second is the number of lines). So I was inadvertently creating > 4 meg lines.The > 4 meg line probably crossed some internal buffer / boundary somewhere, leading to problems. But no error messages here showing up.