flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
159 stars 49 forks source link

job-exec: valgrind error and hang running simple job #6057

Closed grondo closed 4 days ago

grondo commented 4 days ago

I was running something under valgrind and noticed that a simple job now triggers a valgrind error. The job then hangs:

$ flux run -N4 hostname
==1627779== Thread 17 job-exec:
==1627779== Invalid read of size 1
==1627779==    at 0x1F0FE81E: output_cb (exec.c:142)
==1627779==    by 0x1F0FF5E4: exec_output_cb (bulk-exec.c:234)
==1627779==    by 0x1F108465: remote_output_local_unbuf (remote.c:401)
==1627779==    by 0x1F108465: rexec_continuation (remote.c:513)
==1627779==    by 0x4079872: ev_invoke_pending (ev.c:3770)
==1627779==    by 0x407D144: ev_run (ev.c:4190)
==1627779==    by 0x407D144: ev_run (ev.c:4021)
==1627779==    by 0x40521BA: flux_reactor_run (reactor.c:124)
==1627779==    by 0x1F0FC192: mod_main (job-exec.c:1709)
==1627779==    by 0x410B0B: module_thread (module.c:221)
==1627779==    by 0x50681C9: start_thread (in /usr/lib64/libpthread-2.28.so)
==1627779==    by 0x625AE72: clone (in /usr/lib64/libc-2.28.so)
==1627779==  Address 0xa7f9a76 is 0 bytes after a block of size 6 alloc'd
==1627779==    at 0x4C38185: malloc (vg_replace_malloc.c:431)
==1627779==    by 0x1F11A45C: iodecode (ioencode.c:178)
==1627779==    by 0x1F10A4CA: subprocess_rexec_get (client.c:160)
==1627779==    by 0x1F108118: rexec_continuation (remote.c:479)
==1627779==    by 0x4079872: ev_invoke_pending (ev.c:3770)
==1627779==    by 0x407D144: ev_run (ev.c:4190)
==1627779==    by 0x407D144: ev_run (ev.c:4021)
==1627779==    by 0x40521BA: flux_reactor_run (reactor.c:124)
==1627779==    by 0x1F0FC192: mod_main (job-exec.c:1709)
==1627779==    by 0x410B0B: module_thread (module.c:221)
==1627779==    by 0x50681C9: start_thread (in /usr/lib64/libpthread-2.28.so)
==1627779==    by 0x625AE72: clone (in /usr/lib64/libc-2.28.so)
==1627779== 

Perhaps this is related to the recent libsubprocess work? @chu11 are you available to take a look?

chu11 commented 4 days ago

The offending line was

if (streq (data, "enter\n")

in exec.c.. it took awhile and some chat w/ @grondo to finally make sense of why this appeared. The key is that when job-exec began using the libsubprocess UNBUF flag, data returned from libsubprocess is now NOT NUL-terminated. So that streq() is comparing against a string without guaranteed NUL termination. So badness can arise.

this hack fixed things

diff --git a/src/modules/job-exec/exec.c b/src/modules/job-exec/exec.c
index a806c5ca8..53f38e430 100644
--- a/src/modules/job-exec/exec.c
+++ b/src/modules/job-exec/exec.c
@@ -139,7 +139,8 @@ static void output_cb (struct bulk_exec *exec,
     const char *cmd = flux_cmd_arg (flux_subprocess_get_cmd (p), 0);

     if (streq (stream, "stdout")) {
-        if (streq (data, "enter\n")
+        if (len == 6
+            && strncmp (data, "enter\n", 6) == 0
             && exec_barrier_enter (exec) < 0) {
             jobinfo_fatal_error (job,
                                  errno,

Should audit to see if there are other "barrier stuff" that does similar comparisons.

Should also add a -N2 test to the t5000 valgrind test to cover this path.