Closed garlick closed 3 years ago
There are usually 3 fences in a successful mpi_hello. Two occur during MPI_Init()
, the first one with data, the second one not (a barrier). The third fence corresponds to the MPI_Barrier()
call in the hello program, e.g. for 2n2p (labeled "exchange 2"):
[snip]
5.111s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f3gD1p8o","rank":-2}],"info":[],"data":"CQIbBAkCAgA=","cbfunc":140669182951833,"cbdata":140669037411392}
5.111s: flux-shell[0]: TRACE: starting pmix exchange 2: size 0
5.111s: flux-shell[0]: TRACE: completed pmix exchange 2: size 0 SUCCESS
5.114s: flux-shell[0]: TRACE: 0: C: pmi EOF
5.114s: flux-shell[0]: DEBUG: task 0 complete status=0
5.111s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f3gD1p8o","rank":-2}],"info":[],"data":"CQIbBAkCAgA=","cbfunc":139765176152473,"cbdata":139765013904448}
5.111s: flux-shell[1]: TRACE: starting pmix exchange 2: size 0
5.112s: flux-shell[1]: TRACE: completed pmix exchange 2: size 0 SUCCESS
5.114s: flux-shell[1]: TRACE: 1: C: pmi EOF
5.114s: flux-shell[1]: DEBUG: task 1 complete status=0
5.122s: flux-shell[0]: DEBUG: exit 0
5.123s: flux-shell[1]: DEBUG: exit 0
In the 2n3p failing case, the upcalls never come for MPI_Barrier()
.
Running with --env=PMIX_MCA_pmix_client_fence_verbose=100
:
expecting success:
run_timeout 60 flux mini run -N2 -n3 \
-ouserrc=$(pwd)/rc.lua \
-overbose=2 \
-ompi=openmpi@5 \
--env=PMIX_MCA_pmix_client_fence_verbose=100 \
${MPI_HELLO}
0.022s: flux-shell[1]: DEBUG: 1: tasks [2] on cores 0-1
0.022s: flux-shell[1]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.060s: flux-shell[1]: DEBUG: pmix: local_peers = 2
0.060s: flux-shell[1]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.060s: flux-shell[1]: DEBUG: pmix: proc_map = 0,1;2
0.081s: flux-shell[1]: TRACE: shell barrier complete
0.081s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.084s: flux-shell[1]: TRACE: shell barrier complete
0.084s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.020s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=4 cores_per_slot=1 slots_per_node=2
0.020s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
0.020s: flux-shell[0]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.023s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.046s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.1.0rc2
0.080s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.080s: flux-shell[0]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.080s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1;2
0.081s: flux-shell[0]: TRACE: shell barrier complete
0.081s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.084s: flux-shell[0]: TRACE: shell barrier complete
0.084s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.099s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f619NSo9","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbXgkCAgEJAhtWCQIoAgkCHBhidGwudGNwLjUuMAAbQAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAQAArBEAAQAAAAAAAAAAAAAAAAQAAAAQAAAAECcAAAQBAAA=","cbfunc":139689513144729,"cbdata":139689382210496}
0.099s: flux-shell[1]: TRACE: starting pmix exchange 0: size 98
0.123s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f619NSo9","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbzwEJAgIBCQIbbQkCKAAJAhwYYnRsLnRjcC41LjAAG0AKAAINAAAAAAAAAAAAAAAAAgAAABgAAADoAwAABAIAAKwRAAEAAAAAAAAAAAAAAAAEAAAAEAAAABAnAAAEAgAACQIcGnBtbC5iYXNlLjIuMAAbBG9iMQAJAhtWCQIoAQkCHBhidGwudGNwLjUuMAAbQAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAwAArBEAAQAAAAAAAAAAAAAAAAQAAAAQAAAAECcAAAQDAAA=","cbfunc":140288944800153,"cbdata":140288798583904}
0.123s: flux-shell[0]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.123s: flux-shell[0]: TRACE: starting pmix exchange 0: size 212
0.125s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f619NSo9","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":140288944800153,"cbdata":140288798583904}
0.125s: flux-shell[0]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.125s: flux-shell[0]: TRACE: starting pmix exchange 1: size 0
0.124s: flux-shell[1]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.124s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f619NSo9","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":139689513144729,"cbdata":139689382210496}
0.124s: flux-shell[1]: TRACE: starting pmix exchange 1: size 0
0.125s: flux-shell[1]: TRACE: completed pmix exchange 1: size 0 SUCCESS
[system76-pc:2876007] pmix: fence_nb called
[system76-pc:2876008] pmix: fence_nb called
[system76-pc:2876009] pmix: fence_nb called
[system76-pc:2876008] pmix: fence_nb callback recvd
[system76-pc:2876009] pmix: fence_nb callback recvd
[system76-pc:2876008] client:unpack fence called
[system76-pc:2876009] client:unpack fence called
[system76-pc:2876008] client:unpack fence received status 0
[system76-pc:2876009] client:unpack fence received status 0
[system76-pc:2876007] pmix: fence_nb callback recvd
[system76-pc:2876007] client:unpack fence called
[system76-pc:2876007] client:unpack fence received status 0
[system76-pc:2876007] pmix: fence_nb called
[system76-pc:2876008] pmix: fence_nb called
[system76-pc:2876009] pmix: fence_nb called
[system76-pc:2876008] pmix: fence_nb callback recvd
[system76-pc:2876009] pmix: fence_nb callback recvd
[system76-pc:2876008] client:unpack fence called
[system76-pc:2876009] client:unpack fence called
[system76-pc:2876008] client:unpack fence received status 0
[system76-pc:2876009] client:unpack fence received status 0
f619NSo9: completed MPI_Init in 0.041s. There are 3 tasks
[system76-pc:2876007] pmix: fence_nb callback recvd
[system76-pc:2876007] client:unpack fence called
[system76-pc:2876007] client:unpack fence received status 0
<sleep 5 inserted in test, after MPI_Init()>
[system76-pc:2876009] pmix: fence_nb called
<hang until 60s test timeout>
not ok 7 - 2n3p ompi hello doesnt hang
so it would appear that only one ompi proc makes the PMIx_Fence()
call, and that did not result in an upcall.
I should add: the one proc that makes the PMIx_Fence()
call is rank 1:
fXPkH51: rank 1 is pid 3856230
fXPkH51: completed MPI_Init in 0.021s. There are 3 tasks
[system76-pc:3856228] client:unpack fence called
fXPkH51: rank 0 is pid 3856229
[system76-pc:3856228] client:unpack fence received status 0
fXPkH51: rank 2 is pid 3856228
[system76-pc:3856230] pmix: fence_nb called
So the openpmix server on shell 0 is correct in not making the upcall - it is waiting for rank 0 to make the call too.
Still no clue why ompi is not calling PMIx_Fence()
on all ranks when they are all calling into MPI_Barrier()
.
Here's an attempt with some debugging enabled in ompi's BTL (block transfer layer) stack which seems interesting
expecting success:
run_timeout 60 flux mini run -N2 -n3 \
-ouserrc=$(pwd)/rc.lua \
-overbose=2 \
-ompi=openmpi@5 \
--env=OMPI_MCA_btl_base_verbose=100 \
${MPI_HELLO}
0.024s: flux-shell[1]: DEBUG: 1: tasks [2] on cores 0-1
0.024s: flux-shell[1]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.033s: flux-shell[1]: DEBUG: pmix: jobid = 20099104768
0.033s: flux-shell[1]: DEBUG: pmix: shell_rank = 1
0.033s: flux-shell[1]: DEBUG: pmix: local_nprocs = 1
0.033s: flux-shell[1]: DEBUG: pmix: total_nprocs = 3
0.041s: flux-shell[1]: DEBUG: pmix: local_peers = 2
0.041s: flux-shell[1]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.041s: flux-shell[1]: DEBUG: pmix: proc_map = 0,1;2
0.043s: flux-shell[1]: TRACE: shell barrier complete
0.043s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.046s: flux-shell[1]: TRACE: shell barrier complete
0.046s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.021s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=4 cores_per_slot=1 slots_per_node=2
0.021s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
0.021s: flux-shell[0]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.024s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.027s: flux-shell[0]: DEBUG: pmix: jobid = 20099104768
0.027s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
0.027s: flux-shell[0]: DEBUG: pmix: local_nprocs = 2
0.027s: flux-shell[0]: DEBUG: pmix: total_nprocs = 3
0.027s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.1.1rc1
0.037s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.037s: flux-shell[0]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.037s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1;2
0.043s: flux-shell[0]: TRACE: shell barrier complete
0.043s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.046s: flux-shell[0]: TRACE: shell barrier complete
0.046s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.062s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fXd6Ac7","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbXgkCAgEJAhtWCQIoAgkCHBhidGwudGNwLjUuMAAbQAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAQAArBEAAQAAAAAAAAAAAAAAAAQAAAAQAAAAECcAAAQBAAA=","cbfunc":139628907432199,"cbdata":139628782899312}
0.062s: flux-shell[1]: TRACE: starting pmix exchange 0: size 98
0.065s: flux-shell[1]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.066s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fXd6Ac7","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":139628907432199,"cbdata":139628782899312}
0.066s: flux-shell[1]: TRACE: starting pmix exchange 1: size 0
0.066s: flux-shell[1]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.065s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fXd6Ac7","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbzwEJAgIBCQIbbQkCKAAJAhwYYnRsLnRjcC41LjAAG0AKAAINAAAAAAAAAAAAAAAAAgAAABgAAADoAwAABAMAAKwRAAEAAAAAAAAAAAAAAAAEAAAAEAAAABAnAAAEAwAACQIcGnBtbC5iYXNlLjIuMAAbBG9iMQAJAhtWCQIoAQkCHBhidGwudGNwLjUuMAAbQAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEBAAArBEAAQAAAAAAAAAAAAAAAAQAAAAQAAAAECcAAAQEAAA=","cbfunc":140537829677319,"cbdata":140537705360448}
0.065s: flux-shell[0]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.065s: flux-shell[0]: TRACE: starting pmix exchange 0: size 212
0.066s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fXd6Ac7","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":140537829677319,"cbdata":140537705360448}
0.066s: flux-shell[0]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.066s: flux-shell[0]: TRACE: starting pmix exchange 1: size 0
[system76-pc:3874864] mca: base: components_register: registering framework btl components
[system76-pc:3874865] mca: base: components_register: registering framework btl components
[system76-pc:3874864] mca: base: components_register: found loaded component self
[system76-pc:3874865] mca: base: components_register: found loaded component self
[system76-pc:3874864] mca: base: components_register: component self register function successful
[system76-pc:3874865] mca: base: components_register: component self register function successful
[system76-pc:3874864] mca: base: components_register: found loaded component sm
[system76-pc:3874865] mca: base: components_register: found loaded component sm
[system76-pc:3874864] mca: base: components_register: component sm register function successful
[system76-pc:3874865] mca: base: components_register: component sm register function successful
[system76-pc:3874864] mca: base: components_register: found loaded component tcp
[system76-pc:3874865] mca: base: components_register: found loaded component tcp
[system76-pc:3874864] mca: base: components_register: component tcp register function successful
[system76-pc:3874865] mca: base: components_register: component tcp register function successful
[system76-pc:3874864] mca: base: components_open: opening btl components
[system76-pc:3874865] mca: base: components_open: opening btl components
[system76-pc:3874864] mca: base: components_open: found loaded component self
[system76-pc:3874865] mca: base: components_open: found loaded component self
[system76-pc:3874864] mca: base: components_open: component self open function successful
[system76-pc:3874865] mca: base: components_open: component self open function successful
[system76-pc:3874864] mca: base: components_open: found loaded component sm
[system76-pc:3874865] mca: base: components_open: found loaded component sm
[system76-pc:3874864] mca: base: components_open: component sm open function successful
[system76-pc:3874865] mca: base: components_open: component sm open function successful
[system76-pc:3874864] mca: base: components_open: found loaded component tcp
[system76-pc:3874865] mca: base: components_open: found loaded component tcp
[system76-pc:3874864] mca: base: components_open: component tcp open function successful
[system76-pc:3874865] mca: base: components_open: component tcp open function successful
[system76-pc:3874864] select: initializing btl component self
[system76-pc:3874865] select: initializing btl component self
[system76-pc:3874864] select: init of component self returned success
[system76-pc:3874865] select: init of component self returned success
[system76-pc:3874864] select: initializing btl component sm
[system76-pc:3874865] select: initializing btl component sm
[system76-pc1][[5524,0],2][btl_sm_component.c:512:mca_btl_sm_component_init] No peers to communicate with. Disabling sm.
[system76-pc:3874865] select: init of component sm returned success
[system76-pc:3874864] select: init of component sm returned failure
[system76-pc:3874865] select: initializing btl component tcp
[system76-pc:3874864] mca: base: close: component sm closed
[system76-pc:3874865] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[system76-pc:3874864] mca: base: close: unloading component sm
[system76-pc:3874865] btl: tcp: Found match: 127.0.0.1 (lo)
[system76-pc:3874864] select: initializing btl component tcp
[system76-pc:3874865] btl:tcp: 0x55b4a038ad90: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3874864] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[system76-pc:3874865] btl:tcp: 0x55b4a038b920: if docker0 kidx 4 cnt 0 addr 172.17.0.1 IPv4 bw 10000 lt 100
[system76-pc:3874864] btl: tcp: Found match: 127.0.0.1 (lo)
[system76-pc:3874865] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3874864] btl:tcp: 0x55e8271564f0: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3874865] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3874864] btl:tcp: 0x55e827157080: if docker0 kidx 4 cnt 0 addr 172.17.0.1 IPv4 bw 10000 lt 100
[system76-pc:3874865] btl:tcp: Attempting to bind to AF_INET port 1026
[system76-pc:3874864] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3874865] btl:tcp: Attempting to bind to AF_INET port 1027
[system76-pc:3874864] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3874865] btl:tcp: Successfully bound to AF_INET port 1027
[system76-pc:3874864] btl:tcp: Successfully bound to AF_INET port 1025
[system76-pc:3874865] btl:tcp: my listening v4 socket is 0.0.0.0:1027
[system76-pc:3874864] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[system76-pc:3874865] btl: tcp: exchange: 0 2 IPv4 10.0.2.13
[system76-pc:3874864] btl: tcp: exchange: 0 2 IPv4 10.0.2.13
[system76-pc:3874865] btl: tcp: exchange: 1 4 IPv4 172.17.0.1
[system76-pc:3874864] btl: tcp: exchange: 1 4 IPv4 172.17.0.1
[system76-pc:3874865] select: init of component tcp returned success
[system76-pc:3874864] select: init of component tcp returned success
[system76-pc:3874866] mca: base: components_register: registering framework btl components
[system76-pc:3874866] mca: base: components_register: found loaded component self
[system76-pc:3874866] mca: base: components_register: component self register function successful
[system76-pc:3874866] mca: base: components_register: found loaded component sm
[system76-pc:3874866] mca: base: components_register: component sm register function successful
[system76-pc:3874866] mca: base: components_register: found loaded component tcp
[system76-pc:3874866] mca: base: components_register: component tcp register function successful
[system76-pc:3874866] mca: base: components_open: opening btl components
[system76-pc:3874866] mca: base: components_open: found loaded component self
[system76-pc:3874866] mca: base: components_open: component self open function successful
[system76-pc:3874866] mca: base: components_open: found loaded component sm
[system76-pc:3874866] mca: base: components_open: component sm open function successful
[system76-pc:3874866] mca: base: components_open: found loaded component tcp
[system76-pc:3874866] mca: base: components_open: component tcp open function successful
[system76-pc:3874866] select: initializing btl component self
[system76-pc:3874866] select: init of component self returned success
[system76-pc:3874866] select: initializing btl component sm
[system76-pc:3874866] select: init of component sm returned success
[system76-pc:3874866] select: initializing btl component tcp
[system76-pc:3874866] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[system76-pc:3874866] btl: tcp: Found match: 127.0.0.1 (lo)
[system76-pc:3874866] btl:tcp: 0x56085a2a2d90: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3874866] btl:tcp: 0x56085a2a3920: if docker0 kidx 4 cnt 0 addr 172.17.0.1 IPv4 bw 10000 lt 100
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1026
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1027
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1028
[system76-pc:3874866] btl:tcp: Successfully bound to AF_INET port 1028
[system76-pc:3874866] btl:tcp: my listening v4 socket is 0.0.0.0:1028
[system76-pc:3874866] btl: tcp: exchange: 0 2 IPv4 10.0.2.13
[system76-pc:3874866] btl: tcp: exchange: 1 4 IPv4 172.17.0.1
[system76-pc:3874866] select: init of component tcp returned success
[system76-pc:3874865] mca: bml: Using self btl for send to [[5524,0],0] on node system76-pc0
[system76-pc:3874866] mca: bml: Using self btl for send to [[5524,0],1] on node system76-pc0
[system76-pc:3874864] mca: bml: Using self btl for send to [[5524,0],2] on node system76-pc1
[system76-pc:3874865] mca: bml: Using sm btl for send to [[5524,0],1] on node system76-pc0
[system76-pc:3874866] mca: bml: Using sm btl for send to [[5524,0],0] on node system76-pc0
fXd6Ac7: rank 1 is pid 3874866
fXd6Ac7: completed MPI_Init in 0.020s. There are 3 tasks
fXd6Ac7: rank 0 is pid 3874865
fXd6Ac7: rank 2 is pid 3874864
[system76-pc0][[5524,0],0][btl_tcp_proc.c:554:mca_btl_tcp_proc_lookup] adding tcp proc for peer {[[5524,0],2]}
[system76-pc:3874864] mca: bml: Using tcp btl for send to [[5524,0],0] on node system76-pc0
[system76-pc:3874864] mca: bml: Using tcp btl for send to [[5524,0],0] on node system76-pc0
[system76-pc:3874864] btl: tcp: attempting to connect() to [[5524,0],0] address 10.0.2.13 on port 1027
[system76-pc:3874865] btl:tcp: now connected to 10.0.2.13, process [[5524,0],2]
[system76-pc:3874864] btl:tcp: would block, so allowing background progress
[system76-pc:3874864] btl:tcp: connect() to 10.0.2.13:1027 completed (complete_connect), sending connect ACK
[system76-pc:3874865] mca: bml: Using tcp btl for send to [[5524,0],2] on node system76-pc1
[system76-pc:3874865] mca: bml: Using tcp btl for send to [[5524,0],2] on node system76-pc1
[system76-pc:3874865] btl: tcp: attempting to connect() to [[5524,0],2] address 172.17.0.1 on port 1025
[system76-pc:3874865] btl:tcp: would block, so allowing background progress
[system76-pc:3874865] btl:tcp: connect() to 172.17.0.1:1025 completed (complete_connect), sending connect ACK
[system76-pc:3874865] remote peer unexpectedly closed connection while I was waiting for a blocking message
[system76-pc:3874865] CLOSE FD 14 at btl_tcp_endpoint.c:565
[system76-pc:3874864] btl: tcp: Match incoming connection from [[5524,0],0] 10.0.2.13 with locally known IP 172.17.0.1 failed (iface 1/2)!
[system76-pc:3874864] btl:tcp: now connected to 10.0.2.13, process [[5524,0],0]
[system76-pc:3874864] CLOSE FD 14 at btl_tcp_endpoint.c:495
not ok 7 - 2n3p ompi hello doesnt hang
So the three ranks are listening on:
rank 0: [system76-pc:3874865] btl:tcp: my listening v4 socket is 0.0.0.0:1027
rank 1: [system76-pc:3874866] btl:tcp: my listening v4 socket is 0.0.0.0:1028
rank 2: [system76-pc:3874864] btl:tcp: my listening v4 socket is 0.0.0.0:1025
It seems like connections were established over tcp between rank 0 (leader for shell 0) and rank 2 (leader for shell 1).
rank 2 to rank 0: [system76-pc:3874864] btl:tcp: now connected to 10.0.2.13, process [[5524,0],0]
rank 0 to rank 2: [system76-pc:3874865] btl:tcp: now connected to 10.0.2.13, process [[5524,0],2]
Initially I thought maybe the closed connections on 172.17.0.1
(docker0) might be a problem, but looking through the logs it just seems like those connections failed and were discarded in favor of the 10.0.2.13
(enp4s0) ones.
Er, except it works if I exclude docker0
. Hmm.
expecting success:
run_timeout 60 flux mini run -N2 -n3 \
-ouserrc=$(pwd)/rc.lua \
-overbose=2 \
-ompi=openmpi@5 \
--env=OMPI_MCA_btl_base_verbose=100 \
--env=OMPI_MCA_btl_tcp_if_exclude=docker0 \
${MPI_HELLO}
0.020s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=4 cores_per_slot=1 slots_per_node=2
0.020s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
0.021s: flux-shell[0]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.024s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.027s: flux-shell[0]: DEBUG: pmix: jobid = 19058917376
0.027s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
0.027s: flux-shell[0]: DEBUG: pmix: local_nprocs = 2
0.027s: flux-shell[0]: DEBUG: pmix: total_nprocs = 3
0.027s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.1.1rc1
0.038s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.038s: flux-shell[0]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.038s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1;2
0.058s: flux-shell[0]: TRACE: shell barrier complete
0.058s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.062s: flux-shell[0]: TRACE: shell barrier complete
0.062s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.023s: flux-shell[1]: DEBUG: 1: tasks [2] on cores 0-1
0.024s: flux-shell[1]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.048s: flux-shell[1]: DEBUG: pmix: jobid = 19058917376
0.048s: flux-shell[1]: DEBUG: pmix: shell_rank = 1
0.048s: flux-shell[1]: DEBUG: pmix: local_nprocs = 1
0.048s: flux-shell[1]: DEBUG: pmix: total_nprocs = 3
0.056s: flux-shell[1]: DEBUG: pmix: local_peers = 2
0.056s: flux-shell[1]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.056s: flux-shell[1]: DEBUG: pmix: proc_map = 0,1;2
0.058s: flux-shell[1]: TRACE: shell barrier complete
0.058s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.065s: flux-shell[1]: TRACE: shell barrier complete
0.066s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.077s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbXgkCAgEJAhtWCQIoAgkCHBhidGwudGNwLjUuMAAbQH8AAAEAAAAAAAAAAAAAAAABAAAACAAAAGQAAAAEAAAACgACDQAAAAAAAAAAAAAAAAIAAAAYAAAA6AMAAAQAAAA=","cbfunc":140014027995399,"cbdata":140013853560944}
0.078s: flux-shell[1]: TRACE: starting pmix exchange 0: size 98
0.081s: flux-shell[1]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.081s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":140014027995399,"cbdata":140013853560944}
0.081s: flux-shell[1]: TRACE: starting pmix exchange 1: size 0
0.082s: flux-shell[1]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.080s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbzwEJAgIBCQIbVgkCKAEJAhwYYnRsLnRjcC41LjAAG0B/AAABAAAAAAAAAAAAAAAAAQAAAAgAAABkAAAABAIAAAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAgAACQIbbQkCKAAJAhwYYnRsLnRjcC41LjAAG0B/AAABAAAAAAAAAAAAAAAAAQAAAAgAAABkAAAABAMAAAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAwAACQIcGnBtbC5iYXNlLjIuMAAbBG9iMQA=","cbfunc":139657351440647,"cbdata":139657237064768}
0.080s: flux-shell[0]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.081s: flux-shell[0]: TRACE: starting pmix exchange 0: size 212
0.081s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":139657351440647,"cbdata":139657237064768}
0.082s: flux-shell[0]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.082s: flux-shell[0]: TRACE: starting pmix exchange 1: size 0
0.090s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[],"data":"CQIbBAkCAgA=","cbfunc":139657351440647,"cbdata":139657237064768}
0.090s: flux-shell[0]: TRACE: starting pmix exchange 2: size 0
0.090s: flux-shell[0]: TRACE: completed pmix exchange 2: size 0 SUCCESS
[system76-pc:3884805] mca: base: components_register: registering framework btl components
[system76-pc:3884805] mca: base: components_register: found loaded component self
[system76-pc:3884805] mca: base: components_register: component self register function successful
[system76-pc:3884805] mca: base: components_register: found loaded component sm
[system76-pc:3884805] mca: base: components_register: component sm register function successful
[system76-pc:3884805] mca: base: components_register: found loaded component tcp
[system76-pc:3884805] mca: base: components_register: component tcp register function successful
[system76-pc:3884805] mca: base: components_open: opening btl components
[system76-pc:3884805] mca: base: components_open: found loaded component self
[system76-pc:3884805] mca: base: components_open: component self open function successful
[system76-pc:3884805] mca: base: components_open: found loaded component sm
[system76-pc:3884805] mca: base: components_open: component sm open function successful
[system76-pc:3884805] mca: base: components_open: found loaded component tcp
[system76-pc:3884805] mca: base: components_open: component tcp open function successful
[system76-pc:3884805] select: initializing btl component self
[system76-pc:3884805] select: init of component self returned success
[system76-pc:3884805] select: initializing btl component sm
[system76-pc1][[10287,0],2][btl_sm_component.c:512:mca_btl_sm_component_init] No peers to communicate with. Disabling sm.
[system76-pc:3884805] select: init of component sm returned failure
[system76-pc:3884805] mca: base: close: component sm closed
[system76-pc:3884805] mca: base: close: unloading component sm
[system76-pc:3884805] select: initializing btl component tcp
[system76-pc:3884805] btl:tcp: 0x55f5e20c0840: if lo kidx 1 cnt 0 addr 127.0.0.1 IPv4 bw 100 lt 100
[system76-pc:3884805] btl:tcp: 0x55f5e20c1350: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3884805] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3884805] btl:tcp: Successfully bound to AF_INET port 1024
[system76-pc:3884805] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[system76-pc:3884805] btl: tcp: exchange: 0 1 IPv4 127.0.0.1
[system76-pc:3884805] btl: tcp: exchange: 1 2 IPv4 10.0.2.13
[system76-pc:3884805] select: init of component tcp returned success
[system76-pc:3884807] mca: base: components_register: registering framework btl components
[system76-pc:3884807] mca: base: components_register: found loaded component self
[system76-pc:3884807] mca: base: components_register: component self register function successful
[system76-pc:3884807] mca: base: components_register: found loaded component sm
[system76-pc:3884807] mca: base: components_register: component sm register function successful
[system76-pc:3884807] mca: base: components_register: found loaded component tcp
[system76-pc:3884807] mca: base: components_register: component tcp register function successful
[system76-pc:3884807] mca: base: components_open: opening btl components
[system76-pc:3884807] mca: base: components_open: found loaded component self
[system76-pc:3884807] mca: base: components_open: component self open function successful
[system76-pc:3884807] mca: base: components_open: found loaded component sm
[system76-pc:3884807] mca: base: components_open: component sm open function successful
[system76-pc:3884807] mca: base: components_open: found loaded component tcp
[system76-pc:3884807] mca: base: components_open: component tcp open function successful
[system76-pc:3884807] select: initializing btl component self
[system76-pc:3884806] mca: base: components_register: registering framework btl components
[system76-pc:3884807] select: init of component self returned success
[system76-pc:3884806] mca: base: components_register: found loaded component self
[system76-pc:3884807] select: initializing btl component sm
[system76-pc:3884806] mca: base: components_register: component self register function successful
[system76-pc:3884806] mca: base: components_register: found loaded component sm
[system76-pc:3884807] select: init of component sm returned success
[system76-pc:3884806] mca: base: components_register: component sm register function successful
[system76-pc:3884807] select: initializing btl component tcp
[system76-pc:3884806] mca: base: components_register: found loaded component tcp
[system76-pc:3884806] mca: base: components_register: component tcp register function successful
[system76-pc:3884807] btl:tcp: 0x5630284dc130: if lo kidx 1 cnt 0 addr 127.0.0.1 IPv4 bw 100 lt 100
[system76-pc:3884806] mca: base: components_open: opening btl components
[system76-pc:3884807] btl:tcp: 0x5630284dcc40: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3884806] mca: base: components_open: found loaded component self
[system76-pc:3884807] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3884806] mca: base: components_open: component self open function successful
[system76-pc:3884807] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3884806] mca: base: components_open: found loaded component sm
[system76-pc:3884807] btl:tcp: Attempting to bind to AF_INET port 1026
[system76-pc:3884806] mca: base: components_open: component sm open function successful
[system76-pc:3884807] btl:tcp: Successfully bound to AF_INET port 1026
[system76-pc:3884806] mca: base: components_open: found loaded component tcp
[system76-pc:3884807] btl:tcp: my listening v4 socket is 0.0.0.0:1026
[system76-pc:3884806] mca: base: components_open: component tcp open function successful
[system76-pc:3884807] btl: tcp: exchange: 0 1 IPv4 127.0.0.1
[system76-pc:3884806] select: initializing btl component self
[system76-pc:3884807] btl: tcp: exchange: 1 2 IPv4 10.0.2.13
[system76-pc:3884806] select: init of component self returned success
[system76-pc:3884807] select: init of component tcp returned success
[system76-pc:3884806] select: initializing btl component sm
[system76-pc:3884806] select: init of component sm returned success
[system76-pc:3884806] select: initializing btl component tcp
[system76-pc:3884806] btl:tcp: 0x56018b55c0f0: if lo kidx 1 cnt 0 addr 127.0.0.1 IPv4 bw 100 lt 100
[system76-pc:3884806] btl:tcp: 0x56018b55cc00: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3884806] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3884806] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3884806] btl:tcp: Attempting to bind to AF_INET port 1026
[system76-pc:3884806] btl:tcp: Attempting to bind to AF_INET port 1027
[system76-pc:3884806] btl:tcp: Successfully bound to AF_INET port 1027
[system76-pc:3884806] btl:tcp: my listening v4 socket is 0.0.0.0:1027
[system76-pc:3884806] btl: tcp: exchange: 0 1 IPv4 127.0.0.1
[system76-pc:3884806] btl: tcp: exchange: 1 2 IPv4 10.0.2.13
[system76-pc:3884806] select: init of component tcp returned success
[system76-pc:3884806] mca: bml: Using self btl for send to [[10287,0],0] on node system76-pc0
[system76-pc:3884807] mca: bml: Using self btl for send to [[10287,0],1] on node system76-pc0
[system76-pc:3884807] mca: bml: Using sm btl for send to [[10287,0],0] on node system76-pc0
[system76-pc:3884806] mca: bml: Using sm btl for send to [[10287,0],1] on node system76-pc0
[system76-pc:3884805] mca: bml: Using self btl for send to [[10287,0],2] on node system76-pc1
fW3Aw8j: rank 1 is pid 3884807
fW3Aw8j: completed MPI_Init in 0.020s. There are 3 tasks
fW3Aw8j: rank 0 is pid 3884806
fW3Aw8j: rank 2 is pid 3884805
[system76-pc0][[10287,0],0][btl_tcp_proc.c:554:mca_btl_tcp_proc_lookup] adding tcp proc for peer {[[10287,0],2]}
[system76-pc:3884805] mca: bml: Using tcp btl for send to [[10287,0],0] on node system76-pc0
[system76-pc:3884805] mca: bml: Using tcp btl for send to [[10287,0],0] on node system76-pc0
[system76-pc:3884805] btl: tcp: attempting to connect() to [[10287,0],0] address 127.0.0.1 on port 1027
[system76-pc:3884805] btl:tcp: would block, so allowing background progress
[system76-pc:3884805] btl:tcp: connect() to 127.0.0.1:1027 completed (complete_connect), sending connect ACK
[system76-pc:3884806] btl:tcp: now connected to 127.0.0.1, process [[10287,0],2]
[system76-pc:3884806] mca: bml: Using tcp btl for send to [[10287,0],2] on node system76-pc1
[system76-pc:3884806] mca: bml: Using tcp btl for send to [[10287,0],2] on node system76-pc1
[system76-pc:3884806] btl: tcp: attempting to connect() to [[10287,0],2] address 10.0.2.13 on port 1024
[system76-pc:3884806] btl:tcp: would block, so allowing background progress
[system76-pc:3884806] btl:tcp: connect() to 10.0.2.13:1024 completed (complete_connect), sending connect ACK
fW3Aw8j: completed first barrier in 0.008s
[system76-pc:3884805] btl: tcp: Match incoming connection from [[10287,0],0] 10.0.2.13 with locally known IP 127.0.0.1 failed (iface 0/2)!
[system76-pc:3884805] btl:tcp: now connected to 10.0.2.13, process [[10287,0],0]
[system76-pc:3884806] CLOSE FD 14 at btl_tcp_endpoint.c:565
[system76-pc:3884806] CLOSE FD 13 at btl_tcp_endpoint.c:565
[system76-pc:3884807] mca: base: close: component self closed
[system76-pc:3884807] mca: base: close: unloading component self
[system76-pc:3884807] mca: base: close: component sm closed
[system76-pc:3884807] mca: base: close: unloading component sm
[system76-pc:3884806] mca: base: close: component self closed
[system76-pc:3884805] CLOSE FD 14 at btl_tcp_endpoint.c:565
[system76-pc:3884807] CLOSE FD 12 at btl_tcp_component.c:452
[system76-pc:3884806] mca: base: close: unloading component self
[system76-pc:3884805] CLOSE FD 13 at btl_tcp_endpoint.c:565
[system76-pc:3884807] mca: base: close: component tcp closed
[system76-pc:3884806] mca: base: close: component sm closed
[system76-pc:3884807] mca: base: close: unloading component tcp
[system76-pc:3884806] mca: base: close: unloading component sm
[system76-pc:3884806] CLOSE FD 12 at btl_tcp_component.c:452
[system76-pc:3884806] mca: base: close: component tcp closed
[system76-pc:3884805] mca: base: close: component self closed
[system76-pc:3884806] mca: base: close: unloading component tcp
[system76-pc:3884805] mca: base: close: unloading component self
[system76-pc:3884805] CLOSE FD 12 at btl_tcp_component.c:452
[system76-pc:3884805] mca: base: close: component tcp closed
[system76-pc:3884805] mca: base: close: unloading component tcp
fW3Aw8j: completed MPI_Finalize in 0.001s
0.090s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[],"data":"CQIbBAkCAgA=","cbfunc":140014027995399,"cbdata":140013853560944}
0.090s: flux-shell[1]: TRACE: starting pmix exchange 2: size 0
0.090s: flux-shell[1]: TRACE: completed pmix exchange 2: size 0 SUCCESS
0.091s: flux-shell[1]: TRACE: 2: C: pmi EOF
0.091s: flux-shell[1]: DEBUG: task 2 complete status=0
0.101s: flux-shell[1]: DEBUG: exit 0
0.091s: flux-shell[0]: TRACE: 0: C: pmi EOF
0.091s: flux-shell[0]: DEBUG: task 0 complete status=0
0.091s: flux-shell[0]: TRACE: 1: C: pmi EOF
0.091s: flux-shell[0]: DEBUG: task 1 complete status=0
0.103s: flux-shell[0]: DEBUG: exit 0
ok 7 - 2n3p ompi hello doesnt hang
--env=OMPI_MCA_btl_tcp_if_include=lo
also works. I wonder if that should be set in the environment for all ompi tests here.
This doesn't seem to be pmix related unless I'm really missing something.
This appears to be easily reproducible: