flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

ompi hangs on 2n3p hello world #27

Closed garlick closed 3 years ago

garlick commented 3 years ago

This appears to be easily reproducible:

expecting success: 
    run_timeout 10 flux mini run -N2 -n3 \
        -ouserrc=$(pwd)/rc.lua \
        -overbose=2 \
        -ompi=openmpi@5 \
        ${MPI_HELLO}

0.022s: flux-shell[1]: DEBUG: 1: tasks [2] on cores 0-1
0.023s: flux-shell[1]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.056s: flux-shell[1]: DEBUG: pmix: local_peers = 2
0.056s: flux-shell[1]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.056s: flux-shell[1]: DEBUG: pmix: proc_map = 0,1;2
0.058s: flux-shell[1]: TRACE: shell barrier complete
0.058s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.061s: flux-shell[1]: TRACE: shell barrier complete
0.061s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.020s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=4 cores_per_slot=1 slots_per_node=2
0.020s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
0.020s: flux-shell[0]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.024s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.044s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.1.0rc2
0.054s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.054s: flux-shell[0]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.054s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1;2
0.058s: flux-shell[0]: TRACE: shell barrier complete
0.058s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.061s: flux-shell[0]: TRACE: shell barrier complete
0.061s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.076s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f63ByTLj","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbXgkCAgEJAhtWCQIoAgkCHBhidGwudGNwLjUuMAAbQAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAAAArBEAAQAAAAAAAAAAAAAAAAQAAAAQAAAAECcAAAQAAAA=","cbfunc":139857592263065,"cbdata":139857422805952}
0.076s: flux-shell[1]: TRACE: starting pmix exchange 0: size 98
0.112s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f63ByTLj","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbzwEJAgIBCQIbVgkCKAEJAhwYYnRsLnRjcC41LjAAG0AKAAINAAAAAAAAAAAAAAAAAgAAABgAAADoAwAABAIAAKwRAAEAAAAAAAAAAAAAAAAEAAAAEAAAABAnAAAEAgAACQIbbQkCKAAJAhwYYnRsLnRjcC41LjAAG0AKAAINAAAAAAAAAAAAAAAAAgAAABgAAADoAwAABAMAAKwRAAEAAAAAAAAAAAAAAAAEAAAAEAAAABAnAAAEAwAACQIcGnBtbC5iYXNlLjIuMAAbBG9iMQA=","cbfunc":139977727943065,"cbdata":139977614781808}
0.113s: flux-shell[0]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.113s: flux-shell[0]: TRACE: starting pmix exchange 0: size 212
0.114s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f63ByTLj","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":139977727943065,"cbdata":139977614781808}
0.114s: flux-shell[0]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.114s: flux-shell[0]: TRACE: starting pmix exchange 1: size 0
0.113s: flux-shell[1]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.113s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f63ByTLj","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":139857592263065,"cbdata":139857422805952}
0.114s: flux-shell[1]: TRACE: starting pmix exchange 1: size 0
0.114s: flux-shell[1]: TRACE: completed pmix exchange 1: size 0 SUCCESS
f63ByTLj: completed MPI_Init in 0.054s.  There are 3 tasks
not ok 7 - 2n3p ompi hello doesnt hang
garlick commented 3 years ago

There are usually 3 fences in a successful mpi_hello. Two occur during MPI_Init(), the first one with data, the second one not (a barrier). The third fence corresponds to the MPI_Barrier() call in the hello program, e.g. for 2n2p (labeled "exchange 2"):

[snip]
5.111s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f3gD1p8o","rank":-2}],"info":[],"data":"CQIbBAkCAgA=","cbfunc":140669182951833,"cbdata":140669037411392}
5.111s: flux-shell[0]: TRACE: starting pmix exchange 2: size 0
5.111s: flux-shell[0]: TRACE: completed pmix exchange 2: size 0 SUCCESS
5.114s: flux-shell[0]: TRACE: 0: C: pmi EOF
5.114s: flux-shell[0]: DEBUG: task 0 complete status=0
5.111s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f3gD1p8o","rank":-2}],"info":[],"data":"CQIbBAkCAgA=","cbfunc":139765176152473,"cbdata":139765013904448}
5.111s: flux-shell[1]: TRACE: starting pmix exchange 2: size 0
5.112s: flux-shell[1]: TRACE: completed pmix exchange 2: size 0 SUCCESS
5.114s: flux-shell[1]: TRACE: 1: C: pmi EOF
5.114s: flux-shell[1]: DEBUG: task 1 complete status=0
5.122s: flux-shell[0]: DEBUG: exit 0
5.123s: flux-shell[1]: DEBUG: exit 0

In the 2n3p failing case, the upcalls never come for MPI_Barrier().

garlick commented 3 years ago

Running with --env=PMIX_MCA_pmix_client_fence_verbose=100:

expecting success: 
    run_timeout 60 flux mini run -N2 -n3 \
        -ouserrc=$(pwd)/rc.lua \
        -overbose=2 \
        -ompi=openmpi@5 \
        --env=PMIX_MCA_pmix_client_fence_verbose=100 \
        ${MPI_HELLO}

0.022s: flux-shell[1]: DEBUG: 1: tasks [2] on cores 0-1
0.022s: flux-shell[1]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.060s: flux-shell[1]: DEBUG: pmix: local_peers = 2
0.060s: flux-shell[1]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.060s: flux-shell[1]: DEBUG: pmix: proc_map = 0,1;2
0.081s: flux-shell[1]: TRACE: shell barrier complete
0.081s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.084s: flux-shell[1]: TRACE: shell barrier complete
0.084s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.020s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=4 cores_per_slot=1 slots_per_node=2
0.020s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
0.020s: flux-shell[0]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.023s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.046s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.1.0rc2
0.080s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.080s: flux-shell[0]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.080s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1;2
0.081s: flux-shell[0]: TRACE: shell barrier complete
0.081s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.084s: flux-shell[0]: TRACE: shell barrier complete
0.084s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.099s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f619NSo9","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbXgkCAgEJAhtWCQIoAgkCHBhidGwudGNwLjUuMAAbQAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAQAArBEAAQAAAAAAAAAAAAAAAAQAAAAQAAAAECcAAAQBAAA=","cbfunc":139689513144729,"cbdata":139689382210496}
0.099s: flux-shell[1]: TRACE: starting pmix exchange 0: size 98
0.123s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f619NSo9","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbzwEJAgIBCQIbbQkCKAAJAhwYYnRsLnRjcC41LjAAG0AKAAINAAAAAAAAAAAAAAAAAgAAABgAAADoAwAABAIAAKwRAAEAAAAAAAAAAAAAAAAEAAAAEAAAABAnAAAEAgAACQIcGnBtbC5iYXNlLjIuMAAbBG9iMQAJAhtWCQIoAQkCHBhidGwudGNwLjUuMAAbQAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAwAArBEAAQAAAAAAAAAAAAAAAAQAAAAQAAAAECcAAAQDAAA=","cbfunc":140288944800153,"cbdata":140288798583904}
0.123s: flux-shell[0]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.123s: flux-shell[0]: TRACE: starting pmix exchange 0: size 212
0.125s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f619NSo9","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":140288944800153,"cbdata":140288798583904}
0.125s: flux-shell[0]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.125s: flux-shell[0]: TRACE: starting pmix exchange 1: size 0
0.124s: flux-shell[1]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.124s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"f619NSo9","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":139689513144729,"cbdata":139689382210496}
0.124s: flux-shell[1]: TRACE: starting pmix exchange 1: size 0
0.125s: flux-shell[1]: TRACE: completed pmix exchange 1: size 0 SUCCESS
[system76-pc:2876007] pmix: fence_nb called
[system76-pc:2876008] pmix: fence_nb called
[system76-pc:2876009] pmix: fence_nb called
[system76-pc:2876008] pmix: fence_nb callback recvd
[system76-pc:2876009] pmix: fence_nb callback recvd
[system76-pc:2876008] client:unpack fence called
[system76-pc:2876009] client:unpack fence called
[system76-pc:2876008] client:unpack fence received status 0
[system76-pc:2876009] client:unpack fence received status 0
[system76-pc:2876007] pmix: fence_nb callback recvd
[system76-pc:2876007] client:unpack fence called
[system76-pc:2876007] client:unpack fence received status 0
[system76-pc:2876007] pmix: fence_nb called
[system76-pc:2876008] pmix: fence_nb called
[system76-pc:2876009] pmix: fence_nb called
[system76-pc:2876008] pmix: fence_nb callback recvd
[system76-pc:2876009] pmix: fence_nb callback recvd
[system76-pc:2876008] client:unpack fence called
[system76-pc:2876009] client:unpack fence called
[system76-pc:2876008] client:unpack fence received status 0
[system76-pc:2876009] client:unpack fence received status 0
f619NSo9: completed MPI_Init in 0.041s.  There are 3 tasks
[system76-pc:2876007] pmix: fence_nb callback recvd
[system76-pc:2876007] client:unpack fence called
[system76-pc:2876007] client:unpack fence received status 0

<sleep 5 inserted in test, after MPI_Init()>

[system76-pc:2876009] pmix: fence_nb called

<hang until 60s test timeout>

not ok 7 - 2n3p ompi hello doesnt hang

so it would appear that only one ompi proc makes the PMIx_Fence() call, and that did not result in an upcall.

garlick commented 3 years ago

I should add: the one proc that makes the PMIx_Fence() call is rank 1:

fXPkH51: rank 1 is pid 3856230
fXPkH51: completed MPI_Init in 0.021s.  There are 3 tasks
[system76-pc:3856228] client:unpack fence called
fXPkH51: rank 0 is pid 3856229
[system76-pc:3856228] client:unpack fence received status 0
fXPkH51: rank 2 is pid 3856228
[system76-pc:3856230] pmix: fence_nb called

So the openpmix server on shell 0 is correct in not making the upcall - it is waiting for rank 0 to make the call too.

Still no clue why ompi is not calling PMIx_Fence() on all ranks when they are all calling into MPI_Barrier().

garlick commented 3 years ago

Here's an attempt with some debugging enabled in ompi's BTL (block transfer layer) stack which seems interesting

expecting success: 
    run_timeout 60 flux mini run -N2 -n3 \
        -ouserrc=$(pwd)/rc.lua \
        -overbose=2 \
        -ompi=openmpi@5 \
        --env=OMPI_MCA_btl_base_verbose=100 \
        ${MPI_HELLO}

0.024s: flux-shell[1]: DEBUG: 1: tasks [2] on cores 0-1
0.024s: flux-shell[1]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.033s: flux-shell[1]: DEBUG: pmix: jobid = 20099104768
0.033s: flux-shell[1]: DEBUG: pmix: shell_rank = 1
0.033s: flux-shell[1]: DEBUG: pmix: local_nprocs = 1
0.033s: flux-shell[1]: DEBUG: pmix: total_nprocs = 3
0.041s: flux-shell[1]: DEBUG: pmix: local_peers = 2
0.041s: flux-shell[1]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.041s: flux-shell[1]: DEBUG: pmix: proc_map = 0,1;2
0.043s: flux-shell[1]: TRACE: shell barrier complete
0.043s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.046s: flux-shell[1]: TRACE: shell barrier complete
0.046s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.021s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=4 cores_per_slot=1 slots_per_node=2
0.021s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
0.021s: flux-shell[0]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.024s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.027s: flux-shell[0]: DEBUG: pmix: jobid = 20099104768
0.027s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
0.027s: flux-shell[0]: DEBUG: pmix: local_nprocs = 2
0.027s: flux-shell[0]: DEBUG: pmix: total_nprocs = 3
0.027s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.1.1rc1
0.037s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.037s: flux-shell[0]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.037s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1;2
0.043s: flux-shell[0]: TRACE: shell barrier complete
0.043s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.046s: flux-shell[0]: TRACE: shell barrier complete
0.046s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.062s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fXd6Ac7","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbXgkCAgEJAhtWCQIoAgkCHBhidGwudGNwLjUuMAAbQAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAQAArBEAAQAAAAAAAAAAAAAAAAQAAAAQAAAAECcAAAQBAAA=","cbfunc":139628907432199,"cbdata":139628782899312}
0.062s: flux-shell[1]: TRACE: starting pmix exchange 0: size 98
0.065s: flux-shell[1]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.066s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fXd6Ac7","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":139628907432199,"cbdata":139628782899312}
0.066s: flux-shell[1]: TRACE: starting pmix exchange 1: size 0
0.066s: flux-shell[1]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.065s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fXd6Ac7","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbzwEJAgIBCQIbbQkCKAAJAhwYYnRsLnRjcC41LjAAG0AKAAINAAAAAAAAAAAAAAAAAgAAABgAAADoAwAABAMAAKwRAAEAAAAAAAAAAAAAAAAEAAAAEAAAABAnAAAEAwAACQIcGnBtbC5iYXNlLjIuMAAbBG9iMQAJAhtWCQIoAQkCHBhidGwudGNwLjUuMAAbQAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEBAAArBEAAQAAAAAAAAAAAAAAAAQAAAAQAAAAECcAAAQEAAA=","cbfunc":140537829677319,"cbdata":140537705360448}
0.065s: flux-shell[0]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.065s: flux-shell[0]: TRACE: starting pmix exchange 0: size 212
0.066s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fXd6Ac7","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":140537829677319,"cbdata":140537705360448}
0.066s: flux-shell[0]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.066s: flux-shell[0]: TRACE: starting pmix exchange 1: size 0
[system76-pc:3874864] mca: base: components_register: registering framework btl components
[system76-pc:3874865] mca: base: components_register: registering framework btl components
[system76-pc:3874864] mca: base: components_register: found loaded component self
[system76-pc:3874865] mca: base: components_register: found loaded component self
[system76-pc:3874864] mca: base: components_register: component self register function successful
[system76-pc:3874865] mca: base: components_register: component self register function successful
[system76-pc:3874864] mca: base: components_register: found loaded component sm
[system76-pc:3874865] mca: base: components_register: found loaded component sm
[system76-pc:3874864] mca: base: components_register: component sm register function successful
[system76-pc:3874865] mca: base: components_register: component sm register function successful
[system76-pc:3874864] mca: base: components_register: found loaded component tcp
[system76-pc:3874865] mca: base: components_register: found loaded component tcp
[system76-pc:3874864] mca: base: components_register: component tcp register function successful
[system76-pc:3874865] mca: base: components_register: component tcp register function successful
[system76-pc:3874864] mca: base: components_open: opening btl components
[system76-pc:3874865] mca: base: components_open: opening btl components
[system76-pc:3874864] mca: base: components_open: found loaded component self
[system76-pc:3874865] mca: base: components_open: found loaded component self
[system76-pc:3874864] mca: base: components_open: component self open function successful
[system76-pc:3874865] mca: base: components_open: component self open function successful
[system76-pc:3874864] mca: base: components_open: found loaded component sm
[system76-pc:3874865] mca: base: components_open: found loaded component sm
[system76-pc:3874864] mca: base: components_open: component sm open function successful
[system76-pc:3874865] mca: base: components_open: component sm open function successful
[system76-pc:3874864] mca: base: components_open: found loaded component tcp
[system76-pc:3874865] mca: base: components_open: found loaded component tcp
[system76-pc:3874864] mca: base: components_open: component tcp open function successful
[system76-pc:3874865] mca: base: components_open: component tcp open function successful
[system76-pc:3874864] select: initializing btl component self
[system76-pc:3874865] select: initializing btl component self
[system76-pc:3874864] select: init of component self returned success
[system76-pc:3874865] select: init of component self returned success
[system76-pc:3874864] select: initializing btl component sm
[system76-pc:3874865] select: initializing btl component sm
[system76-pc1][[5524,0],2][btl_sm_component.c:512:mca_btl_sm_component_init] No peers to communicate with. Disabling sm.
[system76-pc:3874865] select: init of component sm returned success
[system76-pc:3874864] select: init of component sm returned failure
[system76-pc:3874865] select: initializing btl component tcp
[system76-pc:3874864] mca: base: close: component sm closed
[system76-pc:3874865] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[system76-pc:3874864] mca: base: close: unloading component sm
[system76-pc:3874865] btl: tcp: Found match: 127.0.0.1 (lo)
[system76-pc:3874864] select: initializing btl component tcp
[system76-pc:3874865] btl:tcp: 0x55b4a038ad90: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3874864] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[system76-pc:3874865] btl:tcp: 0x55b4a038b920: if docker0 kidx 4 cnt 0 addr 172.17.0.1 IPv4 bw 10000 lt 100
[system76-pc:3874864] btl: tcp: Found match: 127.0.0.1 (lo)
[system76-pc:3874865] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3874864] btl:tcp: 0x55e8271564f0: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3874865] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3874864] btl:tcp: 0x55e827157080: if docker0 kidx 4 cnt 0 addr 172.17.0.1 IPv4 bw 10000 lt 100
[system76-pc:3874865] btl:tcp: Attempting to bind to AF_INET port 1026
[system76-pc:3874864] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3874865] btl:tcp: Attempting to bind to AF_INET port 1027
[system76-pc:3874864] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3874865] btl:tcp: Successfully bound to AF_INET port 1027
[system76-pc:3874864] btl:tcp: Successfully bound to AF_INET port 1025
[system76-pc:3874865] btl:tcp: my listening v4 socket is 0.0.0.0:1027
[system76-pc:3874864] btl:tcp: my listening v4 socket is 0.0.0.0:1025
[system76-pc:3874865] btl: tcp: exchange: 0 2 IPv4 10.0.2.13
[system76-pc:3874864] btl: tcp: exchange: 0 2 IPv4 10.0.2.13
[system76-pc:3874865] btl: tcp: exchange: 1 4 IPv4 172.17.0.1
[system76-pc:3874864] btl: tcp: exchange: 1 4 IPv4 172.17.0.1
[system76-pc:3874865] select: init of component tcp returned success
[system76-pc:3874864] select: init of component tcp returned success
[system76-pc:3874866] mca: base: components_register: registering framework btl components
[system76-pc:3874866] mca: base: components_register: found loaded component self
[system76-pc:3874866] mca: base: components_register: component self register function successful
[system76-pc:3874866] mca: base: components_register: found loaded component sm
[system76-pc:3874866] mca: base: components_register: component sm register function successful
[system76-pc:3874866] mca: base: components_register: found loaded component tcp
[system76-pc:3874866] mca: base: components_register: component tcp register function successful
[system76-pc:3874866] mca: base: components_open: opening btl components
[system76-pc:3874866] mca: base: components_open: found loaded component self
[system76-pc:3874866] mca: base: components_open: component self open function successful
[system76-pc:3874866] mca: base: components_open: found loaded component sm
[system76-pc:3874866] mca: base: components_open: component sm open function successful
[system76-pc:3874866] mca: base: components_open: found loaded component tcp
[system76-pc:3874866] mca: base: components_open: component tcp open function successful
[system76-pc:3874866] select: initializing btl component self
[system76-pc:3874866] select: init of component self returned success
[system76-pc:3874866] select: initializing btl component sm
[system76-pc:3874866] select: init of component sm returned success
[system76-pc:3874866] select: initializing btl component tcp
[system76-pc:3874866] btl: tcp: Searching for exclude address+prefix: 127.0.0.1 / 8
[system76-pc:3874866] btl: tcp: Found match: 127.0.0.1 (lo)
[system76-pc:3874866] btl:tcp: 0x56085a2a2d90: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3874866] btl:tcp: 0x56085a2a3920: if docker0 kidx 4 cnt 0 addr 172.17.0.1 IPv4 bw 10000 lt 100
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1026
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1027
[system76-pc:3874866] btl:tcp: Attempting to bind to AF_INET port 1028
[system76-pc:3874866] btl:tcp: Successfully bound to AF_INET port 1028
[system76-pc:3874866] btl:tcp: my listening v4 socket is 0.0.0.0:1028
[system76-pc:3874866] btl: tcp: exchange: 0 2 IPv4 10.0.2.13
[system76-pc:3874866] btl: tcp: exchange: 1 4 IPv4 172.17.0.1
[system76-pc:3874866] select: init of component tcp returned success
[system76-pc:3874865] mca: bml: Using self btl for send to [[5524,0],0] on node system76-pc0
[system76-pc:3874866] mca: bml: Using self btl for send to [[5524,0],1] on node system76-pc0
[system76-pc:3874864] mca: bml: Using self btl for send to [[5524,0],2] on node system76-pc1
[system76-pc:3874865] mca: bml: Using sm btl for send to [[5524,0],1] on node system76-pc0
[system76-pc:3874866] mca: bml: Using sm btl for send to [[5524,0],0] on node system76-pc0
fXd6Ac7: rank 1 is pid 3874866
fXd6Ac7: completed MPI_Init in 0.020s.  There are 3 tasks
fXd6Ac7: rank 0 is pid 3874865
fXd6Ac7: rank 2 is pid 3874864
[system76-pc0][[5524,0],0][btl_tcp_proc.c:554:mca_btl_tcp_proc_lookup] adding tcp proc for peer {[[5524,0],2]}
[system76-pc:3874864] mca: bml: Using tcp btl for send to [[5524,0],0] on node system76-pc0
[system76-pc:3874864] mca: bml: Using tcp btl for send to [[5524,0],0] on node system76-pc0
[system76-pc:3874864] btl: tcp: attempting to connect() to [[5524,0],0] address 10.0.2.13 on port 1027
[system76-pc:3874865] btl:tcp: now connected to 10.0.2.13, process [[5524,0],2]
[system76-pc:3874864] btl:tcp: would block, so allowing background progress
[system76-pc:3874864] btl:tcp: connect() to 10.0.2.13:1027 completed (complete_connect), sending connect ACK
[system76-pc:3874865] mca: bml: Using tcp btl for send to [[5524,0],2] on node system76-pc1
[system76-pc:3874865] mca: bml: Using tcp btl for send to [[5524,0],2] on node system76-pc1
[system76-pc:3874865] btl: tcp: attempting to connect() to [[5524,0],2] address 172.17.0.1 on port 1025
[system76-pc:3874865] btl:tcp: would block, so allowing background progress
[system76-pc:3874865] btl:tcp: connect() to 172.17.0.1:1025 completed (complete_connect), sending connect ACK
[system76-pc:3874865] remote peer unexpectedly closed connection while I was waiting for a blocking message
[system76-pc:3874865] CLOSE FD 14 at btl_tcp_endpoint.c:565
[system76-pc:3874864] btl: tcp: Match incoming connection from [[5524,0],0] 10.0.2.13 with locally known IP 172.17.0.1 failed (iface 1/2)!
[system76-pc:3874864] btl:tcp: now connected to 10.0.2.13, process [[5524,0],0]
[system76-pc:3874864] CLOSE FD 14 at btl_tcp_endpoint.c:495
not ok 7 - 2n3p ompi hello doesnt hang

So the three ranks are listening on:

rank 0: [system76-pc:3874865] btl:tcp: my listening v4 socket is 0.0.0.0:1027
rank 1: [system76-pc:3874866] btl:tcp: my listening v4 socket is 0.0.0.0:1028
rank 2: [system76-pc:3874864] btl:tcp: my listening v4 socket is 0.0.0.0:1025

It seems like connections were established over tcp between rank 0 (leader for shell 0) and rank 2 (leader for shell 1).

rank 2 to rank 0: [system76-pc:3874864] btl:tcp: now connected to 10.0.2.13, process [[5524,0],0]
rank 0 to rank 2: [system76-pc:3874865] btl:tcp: now connected to 10.0.2.13, process [[5524,0],2]

Initially I thought maybe the closed connections on 172.17.0.1 (docker0) might be a problem, but looking through the logs it just seems like those connections failed and were discarded in favor of the 10.0.2.13 (enp4s0) ones.

garlick commented 3 years ago

Er, except it works if I exclude docker0. Hmm.

expecting success: 
    run_timeout 60 flux mini run -N2 -n3 \
        -ouserrc=$(pwd)/rc.lua \
        -overbose=2 \
        -ompi=openmpi@5 \
        --env=OMPI_MCA_btl_base_verbose=100 \
        --env=OMPI_MCA_btl_tcp_if_exclude=docker0 \
        ${MPI_HELLO}

0.020s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=4 cores_per_slot=1 slots_per_node=2
0.020s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
0.021s: flux-shell[0]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.024s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.027s: flux-shell[0]: DEBUG: pmix: jobid = 19058917376
0.027s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
0.027s: flux-shell[0]: DEBUG: pmix: local_nprocs = 2
0.027s: flux-shell[0]: DEBUG: pmix: total_nprocs = 3
0.027s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.1.1rc1
0.038s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.038s: flux-shell[0]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.038s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1;2
0.058s: flux-shell[0]: TRACE: shell barrier complete
0.058s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.062s: flux-shell[0]: TRACE: shell barrier complete
0.062s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.023s: flux-shell[1]: DEBUG: 1: tasks [2] on cores 0-1
0.024s: flux-shell[1]: DEBUG: Loading /opt/flux-core-0.29.0-69/etc/flux/shell/initrc.lua
0.048s: flux-shell[1]: DEBUG: pmix: jobid = 19058917376
0.048s: flux-shell[1]: DEBUG: pmix: shell_rank = 1
0.048s: flux-shell[1]: DEBUG: pmix: local_nprocs = 1
0.048s: flux-shell[1]: DEBUG: pmix: total_nprocs = 3
0.056s: flux-shell[1]: DEBUG: pmix: local_peers = 2
0.056s: flux-shell[1]: DEBUG: pmix: node_map = system76-pc0,system76-pc1
0.056s: flux-shell[1]: DEBUG: pmix: proc_map = 0,1;2
0.058s: flux-shell[1]: TRACE: shell barrier complete
0.058s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.065s: flux-shell[1]: TRACE: shell barrier complete
0.066s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.077s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbXgkCAgEJAhtWCQIoAgkCHBhidGwudGNwLjUuMAAbQH8AAAEAAAAAAAAAAAAAAAABAAAACAAAAGQAAAAEAAAACgACDQAAAAAAAAAAAAAAAAIAAAAYAAAA6AMAAAQAAAA=","cbfunc":140014027995399,"cbdata":140013853560944}
0.078s: flux-shell[1]: TRACE: starting pmix exchange 0: size 98
0.081s: flux-shell[1]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.081s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":140014027995399,"cbdata":140013853560944}
0.081s: flux-shell[1]: TRACE: starting pmix exchange 1: size 0
0.082s: flux-shell[1]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.080s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":true}}],"data":"CQIbzwEJAgIBCQIbVgkCKAEJAhwYYnRsLnRjcC41LjAAG0B/AAABAAAAAAAAAAAAAAAAAQAAAAgAAABkAAAABAIAAAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAgAACQIbbQkCKAAJAhwYYnRsLnRjcC41LjAAG0B/AAABAAAAAAAAAAAAAAAAAQAAAAgAAABkAAAABAMAAAoAAg0AAAAAAAAAAAAAAAACAAAAGAAAAOgDAAAEAwAACQIcGnBtbC5iYXNlLjIuMAAbBG9iMQA=","cbfunc":139657351440647,"cbdata":139657237064768}
0.080s: flux-shell[0]: TRACE: completed pmix exchange 0: size 310 SUCCESS
0.081s: flux-shell[0]: TRACE: starting pmix exchange 0: size 212
0.081s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[{"key":"pmix.collect","flags":0,"value":{"type":1,"data":false}}],"data":"CQIbBAkCAgA=","cbfunc":139657351440647,"cbdata":139657237064768}
0.082s: flux-shell[0]: TRACE: completed pmix exchange 1: size 0 SUCCESS
0.082s: flux-shell[0]: TRACE: starting pmix exchange 1: size 0
0.090s: flux-shell[0]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[],"data":"CQIbBAkCAgA=","cbfunc":139657351440647,"cbdata":139657237064768}
0.090s: flux-shell[0]: TRACE: starting pmix exchange 2: size 0
0.090s: flux-shell[0]: TRACE: completed pmix exchange 2: size 0 SUCCESS
[system76-pc:3884805] mca: base: components_register: registering framework btl components
[system76-pc:3884805] mca: base: components_register: found loaded component self
[system76-pc:3884805] mca: base: components_register: component self register function successful
[system76-pc:3884805] mca: base: components_register: found loaded component sm
[system76-pc:3884805] mca: base: components_register: component sm register function successful
[system76-pc:3884805] mca: base: components_register: found loaded component tcp
[system76-pc:3884805] mca: base: components_register: component tcp register function successful
[system76-pc:3884805] mca: base: components_open: opening btl components
[system76-pc:3884805] mca: base: components_open: found loaded component self
[system76-pc:3884805] mca: base: components_open: component self open function successful
[system76-pc:3884805] mca: base: components_open: found loaded component sm
[system76-pc:3884805] mca: base: components_open: component sm open function successful
[system76-pc:3884805] mca: base: components_open: found loaded component tcp
[system76-pc:3884805] mca: base: components_open: component tcp open function successful
[system76-pc:3884805] select: initializing btl component self
[system76-pc:3884805] select: init of component self returned success
[system76-pc:3884805] select: initializing btl component sm
[system76-pc1][[10287,0],2][btl_sm_component.c:512:mca_btl_sm_component_init] No peers to communicate with. Disabling sm.
[system76-pc:3884805] select: init of component sm returned failure
[system76-pc:3884805] mca: base: close: component sm closed
[system76-pc:3884805] mca: base: close: unloading component sm
[system76-pc:3884805] select: initializing btl component tcp
[system76-pc:3884805] btl:tcp: 0x55f5e20c0840: if lo kidx 1 cnt 0 addr 127.0.0.1 IPv4 bw 100 lt 100
[system76-pc:3884805] btl:tcp: 0x55f5e20c1350: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3884805] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3884805] btl:tcp: Successfully bound to AF_INET port 1024
[system76-pc:3884805] btl:tcp: my listening v4 socket is 0.0.0.0:1024
[system76-pc:3884805] btl: tcp: exchange: 0 1 IPv4 127.0.0.1
[system76-pc:3884805] btl: tcp: exchange: 1 2 IPv4 10.0.2.13
[system76-pc:3884805] select: init of component tcp returned success
[system76-pc:3884807] mca: base: components_register: registering framework btl components
[system76-pc:3884807] mca: base: components_register: found loaded component self
[system76-pc:3884807] mca: base: components_register: component self register function successful
[system76-pc:3884807] mca: base: components_register: found loaded component sm
[system76-pc:3884807] mca: base: components_register: component sm register function successful
[system76-pc:3884807] mca: base: components_register: found loaded component tcp
[system76-pc:3884807] mca: base: components_register: component tcp register function successful
[system76-pc:3884807] mca: base: components_open: opening btl components
[system76-pc:3884807] mca: base: components_open: found loaded component self
[system76-pc:3884807] mca: base: components_open: component self open function successful
[system76-pc:3884807] mca: base: components_open: found loaded component sm
[system76-pc:3884807] mca: base: components_open: component sm open function successful
[system76-pc:3884807] mca: base: components_open: found loaded component tcp
[system76-pc:3884807] mca: base: components_open: component tcp open function successful
[system76-pc:3884807] select: initializing btl component self
[system76-pc:3884806] mca: base: components_register: registering framework btl components
[system76-pc:3884807] select: init of component self returned success
[system76-pc:3884806] mca: base: components_register: found loaded component self
[system76-pc:3884807] select: initializing btl component sm
[system76-pc:3884806] mca: base: components_register: component self register function successful
[system76-pc:3884806] mca: base: components_register: found loaded component sm
[system76-pc:3884807] select: init of component sm returned success
[system76-pc:3884806] mca: base: components_register: component sm register function successful
[system76-pc:3884807] select: initializing btl component tcp
[system76-pc:3884806] mca: base: components_register: found loaded component tcp
[system76-pc:3884806] mca: base: components_register: component tcp register function successful
[system76-pc:3884807] btl:tcp: 0x5630284dc130: if lo kidx 1 cnt 0 addr 127.0.0.1 IPv4 bw 100 lt 100
[system76-pc:3884806] mca: base: components_open: opening btl components
[system76-pc:3884807] btl:tcp: 0x5630284dcc40: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3884806] mca: base: components_open: found loaded component self
[system76-pc:3884807] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3884806] mca: base: components_open: component self open function successful
[system76-pc:3884807] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3884806] mca: base: components_open: found loaded component sm
[system76-pc:3884807] btl:tcp: Attempting to bind to AF_INET port 1026
[system76-pc:3884806] mca: base: components_open: component sm open function successful
[system76-pc:3884807] btl:tcp: Successfully bound to AF_INET port 1026
[system76-pc:3884806] mca: base: components_open: found loaded component tcp
[system76-pc:3884807] btl:tcp: my listening v4 socket is 0.0.0.0:1026
[system76-pc:3884806] mca: base: components_open: component tcp open function successful
[system76-pc:3884807] btl: tcp: exchange: 0 1 IPv4 127.0.0.1
[system76-pc:3884806] select: initializing btl component self
[system76-pc:3884807] btl: tcp: exchange: 1 2 IPv4 10.0.2.13
[system76-pc:3884806] select: init of component self returned success
[system76-pc:3884807] select: init of component tcp returned success
[system76-pc:3884806] select: initializing btl component sm
[system76-pc:3884806] select: init of component sm returned success
[system76-pc:3884806] select: initializing btl component tcp
[system76-pc:3884806] btl:tcp: 0x56018b55c0f0: if lo kidx 1 cnt 0 addr 127.0.0.1 IPv4 bw 100 lt 100
[system76-pc:3884806] btl:tcp: 0x56018b55cc00: if enp4s0 kidx 2 cnt 0 addr 10.0.2.13 IPv4 bw 1000 lt 100
[system76-pc:3884806] btl:tcp: Attempting to bind to AF_INET port 1024
[system76-pc:3884806] btl:tcp: Attempting to bind to AF_INET port 1025
[system76-pc:3884806] btl:tcp: Attempting to bind to AF_INET port 1026
[system76-pc:3884806] btl:tcp: Attempting to bind to AF_INET port 1027
[system76-pc:3884806] btl:tcp: Successfully bound to AF_INET port 1027
[system76-pc:3884806] btl:tcp: my listening v4 socket is 0.0.0.0:1027
[system76-pc:3884806] btl: tcp: exchange: 0 1 IPv4 127.0.0.1
[system76-pc:3884806] btl: tcp: exchange: 1 2 IPv4 10.0.2.13
[system76-pc:3884806] select: init of component tcp returned success
[system76-pc:3884806] mca: bml: Using self btl for send to [[10287,0],0] on node system76-pc0
[system76-pc:3884807] mca: bml: Using self btl for send to [[10287,0],1] on node system76-pc0
[system76-pc:3884807] mca: bml: Using sm btl for send to [[10287,0],0] on node system76-pc0
[system76-pc:3884806] mca: bml: Using sm btl for send to [[10287,0],1] on node system76-pc0
[system76-pc:3884805] mca: bml: Using self btl for send to [[10287,0],2] on node system76-pc1
fW3Aw8j: rank 1 is pid 3884807
fW3Aw8j: completed MPI_Init in 0.020s.  There are 3 tasks
fW3Aw8j: rank 0 is pid 3884806
fW3Aw8j: rank 2 is pid 3884805
[system76-pc0][[10287,0],0][btl_tcp_proc.c:554:mca_btl_tcp_proc_lookup] adding tcp proc for peer {[[10287,0],2]}
[system76-pc:3884805] mca: bml: Using tcp btl for send to [[10287,0],0] on node system76-pc0
[system76-pc:3884805] mca: bml: Using tcp btl for send to [[10287,0],0] on node system76-pc0
[system76-pc:3884805] btl: tcp: attempting to connect() to [[10287,0],0] address 127.0.0.1 on port 1027
[system76-pc:3884805] btl:tcp: would block, so allowing background progress
[system76-pc:3884805] btl:tcp: connect() to 127.0.0.1:1027 completed (complete_connect), sending connect ACK
[system76-pc:3884806] btl:tcp: now connected to 127.0.0.1, process [[10287,0],2]
[system76-pc:3884806] mca: bml: Using tcp btl for send to [[10287,0],2] on node system76-pc1
[system76-pc:3884806] mca: bml: Using tcp btl for send to [[10287,0],2] on node system76-pc1
[system76-pc:3884806] btl: tcp: attempting to connect() to [[10287,0],2] address 10.0.2.13 on port 1024
[system76-pc:3884806] btl:tcp: would block, so allowing background progress
[system76-pc:3884806] btl:tcp: connect() to 10.0.2.13:1024 completed (complete_connect), sending connect ACK
fW3Aw8j: completed first barrier in 0.008s
[system76-pc:3884805] btl: tcp: Match incoming connection from [[10287,0],0] 10.0.2.13 with locally known IP 127.0.0.1 failed (iface 0/2)!
[system76-pc:3884805] btl:tcp: now connected to 10.0.2.13, process [[10287,0],0]
[system76-pc:3884806] CLOSE FD 14 at btl_tcp_endpoint.c:565
[system76-pc:3884806] CLOSE FD 13 at btl_tcp_endpoint.c:565
[system76-pc:3884807] mca: base: close: component self closed
[system76-pc:3884807] mca: base: close: unloading component self
[system76-pc:3884807] mca: base: close: component sm closed
[system76-pc:3884807] mca: base: close: unloading component sm
[system76-pc:3884806] mca: base: close: component self closed
[system76-pc:3884805] CLOSE FD 14 at btl_tcp_endpoint.c:565
[system76-pc:3884807] CLOSE FD 12 at btl_tcp_component.c:452
[system76-pc:3884806] mca: base: close: unloading component self
[system76-pc:3884805] CLOSE FD 13 at btl_tcp_endpoint.c:565
[system76-pc:3884807] mca: base: close: component tcp closed
[system76-pc:3884806] mca: base: close: component sm closed
[system76-pc:3884807] mca: base: close: unloading component tcp
[system76-pc:3884806] mca: base: close: unloading component sm
[system76-pc:3884806] CLOSE FD 12 at btl_tcp_component.c:452
[system76-pc:3884806] mca: base: close: component tcp closed
[system76-pc:3884805] mca: base: close: component self closed
[system76-pc:3884806] mca: base: close: unloading component tcp
[system76-pc:3884805] mca: base: close: unloading component self
[system76-pc:3884805] CLOSE FD 12 at btl_tcp_component.c:452
[system76-pc:3884805] mca: base: close: component tcp closed
[system76-pc:3884805] mca: base: close: unloading component tcp
fW3Aw8j: completed MPI_Finalize in 0.001s
0.090s: flux-shell[1]: TRACE: pmix server fence_upcall {"procs":[{"nspace":"fW3Aw8j","rank":-2}],"info":[],"data":"CQIbBAkCAgA=","cbfunc":140014027995399,"cbdata":140013853560944}
0.090s: flux-shell[1]: TRACE: starting pmix exchange 2: size 0
0.090s: flux-shell[1]: TRACE: completed pmix exchange 2: size 0 SUCCESS
0.091s: flux-shell[1]: TRACE: 2: C: pmi EOF
0.091s: flux-shell[1]: DEBUG: task 2 complete status=0
0.101s: flux-shell[1]: DEBUG: exit 0
0.091s: flux-shell[0]: TRACE: 0: C: pmi EOF
0.091s: flux-shell[0]: DEBUG: task 0 complete status=0
0.091s: flux-shell[0]: TRACE: 1: C: pmi EOF
0.091s: flux-shell[0]: DEBUG: task 1 complete status=0
0.103s: flux-shell[0]: DEBUG: exit 0
ok 7 - 2n3p ompi hello doesnt hang
garlick commented 3 years ago

--env=OMPI_MCA_btl_tcp_if_include=lo also works. I wonder if that should be set in the environment for all ompi tests here.

This doesn't seem to be pmix related unless I'm really missing something.