Closed grondo closed 4 days ago
The offending line was
if (streq (data, "enter\n")
in exec.c
.. it took awhile and some chat w/ @grondo to finally make sense of why this appeared. The key is that when job-exec
began using the libsubprocess UNBUF flag, data returned from libsubprocess is now NOT NUL-terminated. So that streq()
is comparing against a string without guaranteed NUL termination. So badness can arise.
this hack fixed things
diff --git a/src/modules/job-exec/exec.c b/src/modules/job-exec/exec.c
index a806c5ca8..53f38e430 100644
--- a/src/modules/job-exec/exec.c
+++ b/src/modules/job-exec/exec.c
@@ -139,7 +139,8 @@ static void output_cb (struct bulk_exec *exec,
const char *cmd = flux_cmd_arg (flux_subprocess_get_cmd (p), 0);
if (streq (stream, "stdout")) {
- if (streq (data, "enter\n")
+ if (len == 6
+ && strncmp (data, "enter\n", 6) == 0
&& exec_barrier_enter (exec) < 0) {
jobinfo_fatal_error (job,
errno,
Should audit to see if there are other "barrier stuff" that does similar comparisons.
Should also add a -N2 test to the t5000 valgrind test to cover this path.
I was running something under valgrind and noticed that a simple job now triggers a valgrind error. The job then hangs:
Perhaps this is related to the recent libsubprocess work? @chu11 are you available to take a look?