flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

t0006-notify.t fails with segfault in shell plugin #91

Closed garlick closed 11 months ago

garlick commented 11 months ago

Problem: The first test fails with the following stack trace, which seems to indicate a user-after-free on message in notify_shell_cb()

$ ./t0006-notify.t -v
expecting success: 
    run_timeout 30 flux run \
        ${NOTIFY} --status=69 2>warn.err &&
    grep event-status=69 warn.err

not ok 1 - 1n1p event notify triggers warning on stderr
#   
#       run_timeout 30 flux run \
#           ${NOTIFY} --status=69 2>warn.err &&
#       grep event-status=69 warn.err
#   

expecting success: 
    run_timeout 30 flux run \
        ${NOTIFY} --status=69 --message="lorem ipsum" 2>message.err &&
    grep "lorem ipsum" message.err

0.235s: flux-shell[0]:  WARN: pmix: notify source=f2swcE4f.0 event-status=69 lorem ipsum
ok 2 - 1n1p event notify with message works

# failed 1 among 2 test(s)
1..2
Oct 03 18:09:41.842845 broker.err[0]: rc2.0: sh ./t0006-notify.t  --verbose Exited (rc=1) 2.9s
flux-start: 0 (pid 2273874) exited with rc=1

$ cat trash*/warn.err
flux-job: task(s) Segmentation fault
$

gdb backtrace snippet:

#6  0x000000558935f508 in flux_shell_log (component=0x20033a4498 "pmix", level=4, file=0x20033a4488 "notify.c", line=86, 
    fmt=<optimized out>) at log.c:201
        buf = "notify source=f2Hjxo1H.0 event-status=69 \000\000\000\000\000\000\000\030\354\301\356\177\000\000\000\000\353\301\356\177\000\000\000\000\300\060\000 \000\000\000\360%\016\241U\000\000\000`)\027\241U\000\000\000\060\353\301\356\177\000\000\000\250\363\065\211U\000\000\000\000 ;\211U\000\000\000\350,;\211U", '\000' <repeats 11 times>, "\030\354\301\356\177\000\000\000\000>:\003 \000\000\000h\000\000\000\000\000\000\000\000\353\301\356\177\000\000\000\000\300\060\000 \000\000\000\001", '\000' <repeats 15 times>, "\020>:\003 \000\000\000\000\036m\217\354B\375\240"...
        ap = {__stack = 0x7feec1fae0, __gr_top = 0x7feec1fae0, __vr_top = 0x7feec1fac0, __gr_offs = -24, __vr_offs = -128}
#7  0x00000020033a213c in notify_shell_cb () from /nfshome/garlick/proj/flux-pmix/src/shell/plugins/.libs/pmix.so
No symbol table info available.
#8  0x000000200339ff24 in interthread_recv () from /nfshome/garlick/proj/flux-pmix/src/shell/plugins/.libs/pmix.so
No symbol table info available.
#9  0x0000002000337814 in ev_invoke_pending (loop=0x200038d440 <default_loop_struct>) at ev.c:3770
        p = <optimized out>

Note this is on the working branch for #90