flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

TOSS 4 baseline openmpi install segfaults with -opmi=pmix on el cap #99

Closed garlick closed 6 months ago

garlick commented 6 months ago

Problem: openmpi hello world causes the flux shell to segfault when run with -opmi=pmix on el cap

Even with one task.

Launch script:

#!/bin/bash
module purge
module use /opt/toss/modules/modulefiles
module load openmpi-gnu
flux run \
    --env=HWLOC_COMPONENTS=-rsmi \
    -opmi=pmix \
    --env=OMPI_MCA_btl=^openib \
    $@

Software versions are

[garlick@elcap217:mpi-test]$ rpm -q pmix flux-pmix flux-core
pmix-3.2.5-0.t4.x86_64
flux-pmix-0.4.0-1.t4.x86_64
flux-core-0.58.0-2.t4.x86_64
 [garlick@elcap217:mpi-test]$ ml

Currently Loaded Modules:
  1) openmpi-gnu/4.1

[garlick@elcap217:mpi-test]$ which mpicc
/opt/toss/openmpi/4.1/gnu/bin/mpicc

And the failure:

[garlick@elcap217:mpi-test]$ ../doit ./hello
flux-job: task(s) Segmentation fault

@grondo got this stack trace

(gdb) where
#0  0x0000155553245775 in __strncmp_avx2 () from /lib64/libc.so.6
#1  0x000015554043b392 in dohash () from /usr/lib64/pmix/mca_gds_hash.so
#2  0x000015554043c2bd in hash_fetch () from /usr/lib64/pmix/mca_gds_hash.so
#3  0x000015554064b286 in _store_job_info (ds_ctx=ds_ctx@entry=0x7484f0, 
    ns_map=ns_map@entry=0x7486e8, proc=proc@entry=0x15553f6115f0)
    at dstore_base.c:2720
#4  0x0000155540654476 in pmix_common_dstor_register_job_info (
    ds_ctx=0x7484f0, pr=0x1555380169e0, reply=<optimized out>)
    at dstore_base.c:2892
#5  0x00001555438ffb31 in server_switchyard () from /lib64/libpmix.so.2
#6  0x0000155543900c26 in pmix_server_message_handler ()
   from /lib64/libpmix.so.2
#7  0x00001555439834b5 in pmix_ptl_base_process_msg () from /lib64/libpmix.so.2
#8  0x000015554364cd85 in event_process_active_single_queue ()
   from /lib64/libevent-2.1.so.6
#9  0x000015554364d787 in event_base_loop () from /lib64/libevent-2.1.so.6
#10 0x000015554391e6a6 in progress_engine () from /lib64/libpmix.so.2
#11 0x000015555465f1ca in start_thread () from /lib64/libpthread.so.0
#12 0x00001555531b7e73 in clone () from /lib64/libc.so.6

This is run in a flux alloc -N16 allocation in the "rabbit" queue.

Note that when running something else like /bin/true under -opmi=pmix there is no segfault.

garlick commented 6 months ago

The exact test used to reproduce this yesterday is apparently working today.

We don't know what changed. Closing for now.