flux-framework / flux-core

core services for the Flux resource management framework
GNU Lesser General Public License v3.0
168 stars 50 forks source link

MPI tests failling in fedora 39 ci docker image #5694

Open grondo opened 10 months ago

grondo commented 10 months ago

Just setting up a fedora39 builder for ci and everything works except for the MPI tests.

Details:

$ (. /etc/os-release && echo $PRETTY_NAME)
Fedora Linux 39 (Container Image)
$ rpm -q mpich gcc
mpich-4.1.2-3.fc39.x86_64
gcc-13.2.1-6.fc39.x86_64

By default, even a singleton MPI hello test fails:

$ ./mpi/hello 
asp:pid8699.hello: Failed to get docker0 (unit 0) cpu set
asp:pid8699.hello: Failed to get ens18 (unit 1) cpu set
asp:pid8699: PSM3 can't open nic unit: -1 (err=23)
Abort(874633103): Fatal error in internal_Init: Other MPI error, error stack:
internal_Init(66)........: MPI_Init(argc=0x7ffe559bc93c, argv=0x7ffe559bc930) failed
MPII_Init_thread(234)....: 
MPID_Init(513)...........: 
MPIDI_OFI_init_local(604): 
create_vni_context(982)..: OFI endpoint open failed (ofi_init.c:982:create_vni_context:Cannot allocate memory)

flux run also fails with the same error.

Googling turned up this link https://github.com/open-mpi/ompi/issues/11295#issuecomment-1384539750

which suggests setting PSM3_DEVICES=self,shm and/or PSM3_HAL=loopback.

PSM3_DEVICES=self,shm, PSM3_DEVICES=self and PSM3_DEVICES=shm all work for the hello test:

$ PSM3_DEVICES=self ./mpi/hello 
0: completed MPI_Init in 0.088s.  There are 1 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.001s
$ PSM3_DEVICES=shm ./mpi/hello 
0: completed MPI_Init in 0.091s.  There are 1 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.002s
$ PSM3_DEVICES=self,shm ./mpi/hello 
0: completed MPI_Init in 0.113s.  There are 1 tasks
0: completed first barrier in 0.000s
0: completed MPI_Finalize in 0.002s

However, this is not sufficient to pass all the tests in the testsuite. For example

expecting success: 
    run_timeout 30 flux run -n 2 -N2 $OPTS \
        ${FLUX_BUILD_DIR}/t/mpi/mpich_basic/sendrecv

0.056s: flux-shell[1]: DEBUG: Loading /usr/src/src/shell/initrc.lua
0.058s: flux-shell[1]: DEBUG: pmi-simple: simple wire protocol is enabled
0.058s: flux-shell[1]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.058s: flux-shell[1]: DEBUG: 1: task 1 on cores 0
0.053s: flux-shell[0]: DEBUG: Loading /usr/src/src/shell/initrc.lua
0.054s: flux-shell[0]: DEBUG: pmi-simple: simple wire protocol is enabled
0.056s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.056s: flux-shell[0]: DEBUG: cpu-affinity: disabling affinity due to cpu-affinity=off
0.057s: flux-shell[0]: DEBUG: 0: task_count=2 slot_count=2 cores_per_slot=1 slots_per_node=1
0.057s: flux-shell[0]: DEBUG: 0: task 0 on cores 0
Simple Send/Recv test.
Simple Send/Recv test.
Rank 0: sending 100 bytes messages to process 1.
Rank 0: sending 100k bytes messages to process 1.
Rank 1: receiving messages from process 0.
Rank 1: received message 'Hello process one.'
asp:rank1.sendrecv: Reading from remote process' memory failed. Disabling CMA support
asp:rank1: Assertion failure at prov/psm3/psm3/ptl_am/ptl.c:184: nbytes == req->req_data.recv_msglen
3.603s: flux-shell[1]: DEBUG: task 1 complete status=1
3.614s: flux-shell[1]: DEBUG: exit 1

sendrecv:15221 terminated with signal 6 at PC=7f37f7ea1834 SP=7ffdc8b22b40.  Backtrace:
/lib64/libc.so.6(+0x90834)[0x7f37f7ea1834]
/lib64/libc.so.6(raise+0x1e)[0x7f37f7e4f8ee]
/lib64/libc.so.6(abort+0xdf)[0x7f37f7e378ff]
/lib64/libfabric.so.1(+0x44207)[0x7f37f7670207]
/lib64/libfabric.so.1(+0x5fc67b)[0x7f37f7c2867b]
/lib64/libfabric.so.1(+0x673b36)[0x7f37f7c9fb36]
/lib64/libfabric.so.1(+0x608728)[0x7f37f7c34728]
/lib64/libfabric.so.1(+0x61491f)[0x7f37f7c4091f]
/lib64/libfabric.so.1(+0x614ee3)[0x7f37f7c40ee3]
/lib64/libfabric.so.1(+0x5f9e43)[0x7f37f7c25e43]
/lib64/libfabric.so.1(+0x671460)[0x7f37f7c9d460]
/lib64/libfabric.so.1(+0x5d7633)[0x7f37f7c03633]
/usr/lib64/mpich/lib/libmpi.so.12(+0x41f1c6)[0x7f37f85a51c6]
/usr/lib64/mpich/lib/libmpi.so.12(+0x269fb8)[0x7f37f83effb8]
/usr/lib64/mpich/lib/libmpi.so.12(+0x45bfc6)[0x7f37f85e1fc6]
/usr/lib64/mpich/lib/libmpi.so.12(+0x46887e)[0x7f37f85ee87e]
/usr/lib64/mpich/lib/libmpi.so.12(MPI_Recv+0x2fe)[0x7f37f828505e]
/usr/src/t/mpi/mpich_basic/sendrecv[0x4014cd]
/lib64/libc.so.6(+0x2814a)[0x7f37f7e3914a]
/lib64/libc.so.6(__libc_start_main+0x8b)[0x7f37f7e3920b]
/usr/src/t/mpi/mpich_basic/sendrecv[0x4015d5]

However, using PSM3_HAL=loopback allows all tests to pass.

$ FLUX_TEST_MPI=t PSM3_HAL=loopback ./t3000-mpi-basic.t
Jan 19 18:55:22.499078 broker.err[0]: rc1.0: Setting fake resource.R={"version": 1, "execution": {"R_lite": [{"rank": "0-1", "children": {"core": "0-3"}}], "starttime": 0.0, "expiration": 0.0, "nodelist": ["asp,asp"]}}
ok 1 - show MPI version under test
ok 2 - mpi hello various sizes
ok 3 - mpi hello size=2 concurrent submit of 8 jobs
ok 4 - ANL self works
ok 5 - ANL simple works
ok 6 - ANL sendrecv works on 1 node
ok 7 - ANL sendrecv works on 2 nodes
ok 8 # skip ANL netpipe works (missing LONGTEST)
ok 9 - ANL patterns works 1 node
ok 10 - ANL patterns works 2 nodes
ok 11 # skip ANL adapt works (missing LONGTEST)
# passed all 11 test(s)
1..11
Jan 19 18:55:35.415844 broker.err[0]: cleanup.1: flux-cancel: Matched 0 jobs
garlick commented 10 months ago

I'm not sure if this is relevant but on the fedora page for mpich, I found this:

This build also include support for using the 'module environment' to select which MPI implementation to use when multiple implementations are installed. If you want MPICH support to be automatically loaded, you need to install the mpich-autoload package.

https://packages.fedoraproject.org/pkgs/mpich/mpich/

trws commented 3 months ago

I think we can reasonably close this now that we know this is caused by the upstream inclusion of the PSM3 device interacting badly with virtual network interfaces. Do you agree @grondo?