flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

el8 segfault in mca_pmdl_ompi5.so #65

Closed garlick closed 1 year ago

garlick commented 1 year ago

Problem: many flux-pmix tests fail with an immediate segfault in EL8 builder. Even tests like t0002-basic.t test 2 which just runs a non-pmix/mpi program with the pmix.so shell plugin loaded.

This is with ompi v5.0.0rc2 + openpmix v4.1.1rc2.

(gdb) bt full
#0  0x00007f982f2df9a6 in register_nspace () from /usr/lib64/pmix/mca_pmdl_ompi5.so
No symbol table info available.
#1  0x00007f9834520060 in pmix_pmdl_base_register_nspace (nptr=0x7f9828001010) at base/pmdl_base_stubs.c:174
        active = 0x1ec3d50
        rc = -1366
#2  0x00007f98343dd431 in _register_nspace (sd=-1, args=4, cbdata=0x1ee5720) at server/pmix_server.c:1064
        cd = 0x1ee5720
        nptr = 0x7f9828001010
        tmp = 0x7f98347ea810 <pmix_globals+1456>
        rc = 0
        i = 11
        m = 6
        ninfo = 0
        iptr = 0x0
        all_def = false
        trk = 0x8
        ns = 0x0
        tcd = 0x0
        gds = 0x0
        kv = 0x7f9838362ac9 <read+89>
        proc = {
          nspace = '\000' <repeats 16 times>, "\001\000\000\000\377\377\377\377\360M\356\001\000\000\000\000\000\000\000\000\377\377\377\377\340M\356\001", '\000' <repeats 100 times>, "\340t\006\067\230\177\000\000\340\251\006\067\230\177\000\000", '\377' <repeats 16 times>, '\000' <repeats 33 times>..., rank = 628303104}
        __PRETTY_FUNCTION__ = "_register_nspace"
#3  0x00007f9833d0ea05 in event_process_active_single_queue () from /lib64/libevent_core-2.1.so.6
No symbol table info available.
#4  0x00007f9833d0f3ef in event_base_loop () from /lib64/libevent_core-2.1.so.6
No symbol table info available.
#5  0x00007f983442e12d in progress_engine (obj=0x1e40ed0) at runtime/pmix_progress_threads.c:227
        t = 0x1e40ed0
        trk = 0x1e40de0
#6  0x00007f98383591cf in start_thread () from /lib64/libpthread.so.0
No symbol table info available.
#7  0x00007f9836cdfe73 in clone () from /lib64/libc.so.6
No symbol table info available.