flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

multiple test failures with recent flux-core (and ompi 4.0.x) #55

Closed grondo closed 2 years ago

grondo commented 2 years ago

Under docker-run-checks --build-arg=OMPI_BRANCH=4.0.x (due to #54), several tests fail during make check:

SKIP: t2001-osu-benchmarks.t - skipping OSU micro benchmarks due to missing LONGTEST prereq
SKIP: t2002-mpibench.t - skipping mpibench due to missing LONGTEST prereq
PASS: t0007-dmodex.t 1 - 2n2p fetch a nonexistent key on another rank triggers direct_modex
PASS: t0000-sharness.t 1 - sourcing sharness succeeds
PASS: t0000-sharness.t 2 - success is reported like this
XFAIL: t0000-sharness.t 3 - pretend we have a known breakage # TODO known breakage
PASS: t0000-sharness.t 4 - pretend we have a fully passing test suite
PASS: t0000-sharness.t 5 - pretend we have a partially passing test suite
PASS: t0000-sharness.t 6 - pretend we have a known breakage
PASS: t0000-sharness.t 7 - pretend we have fixed a known breakage
PASS: t0000-sharness.t 8 - pretend we have fixed one of two known breakages (run in sub sharness)
PASS: t0000-sharness.t 9 - pretend we have a pass, fail, and known breakage
PASS: t0000-sharness.t 10 - pretend we have a mix of all possible results
PASS: t0000-sharness.t 11 - test runs if prerequisite is satisfied
SKIP: t0000-sharness.t 12 # SKIP unmet prerequisite causes test to be skipped (missing DONTHAVEIT)
PASS: t0000-sharness.t 13 - test runs if prerequisites are satisfied
SKIP: t0000-sharness.t 14 # SKIP unmet prerequisites causes test to be skipped (missing DONTHAVEIT of HAVEIT,DONTHAVEIT)
SKIP: t0000-sharness.t 15 # SKIP unmet prerequisites causes test to be skipped (missing DONTHAVEIT of DONTHAVEIT,HAVEIT)
PASS: t0000-sharness.t 16 - tests clean up after themselves
PASS: t0000-sharness.t 17 - tests clean up even on failures
PASS: t0000-sharness.t 18 - cleanup functions tun at the end of the test
PASS: t0000-sharness.t 19 - We detect broken && chains
PASS: t0000-sharness.t 20 - tests can be run from an alternate directory
PASS: t0000-sharness.t 21 - SHARNESS_ORIG_TERM propagated to sub-sharness
SKIP: t0000-sharness.t 22 # SKIP sub-sharness still has color (missing PERL_AND_TTY,COLOR of COLOR,PERL_AND_TTY)
PASS: t0000-sharness.t 23 - EXPENSIVE prereq not activated by default
PASS: t0000-sharness.t 24 - EXPENSIVE prereq is activated by --long
PASS: t0000-sharness.t 25 - loading sharness extensions works
PASS: t0000-sharness.t 26 - empty sharness.d directory does not cause failure
SKIP: t0000-sharness.t 27 # SKIP Interactive tests work (missing INTERACTIVE)
PASS: t0006-notify.t 1 - 1n1p event notify triggers warning on stderr
PASS: t0006-notify.t 2 - 1n1p event notify with message works
PASS: t0005-abort.t 1 - 1n1p abort on rank 0 works
PASS: t0005-abort.t 2 - stderr contains abort message and exit code
PASS: t0005-abort.t 3 - 1n2p abort on rank 1 works
PASS: t0005-abort.t 4 - 2n2p abort on rank 1 works
PASS: t0005-abort.t 5 - 1n1p abort on rank 0 works with no message
PASS: t0004-bizcard.t 1 - print pmix library version
FAIL: t0004-bizcard.t 2 - 1n2p bizcard exchange works
FAIL: t0004-bizcard.t 3 - 2n2p bizcard exchange works
FAIL: t0004-bizcard.t 4 - 2n3p bizcard exchange works
FAIL: t0004-bizcard.t 5 - 2n4p bizcard exchange works
ERROR: t0004-bizcard.t - exited with status 1
PASS: t1000-ompi-basic.t 1 - capture the job environment
PASS: t1000-ompi-basic.t 2 - verify deprecated flux pmix/schizo plugins are not requested
PASS: t1000-ompi-basic.t 3 - sanity check pmix environment
PASS: t1000-ompi-basic.t 4 - 1n2p ompi hello
PASS: t1000-ompi-basic.t 5 - 2n2p ompi hello
PASS: t1000-ompi-basic.t 6 - 2n3p ompi hello doesnt hang
PASS: t1000-ompi-basic.t 7 - 2n4p ompi hello reports no system call errors
PASS: t1000-ompi-basic.t 8 - 1n2p ompi pingpong works
PASS: t1000-ompi-basic.t 9 - 2n2p ompi pingpong works
PASS: t0002-basic.t 1 - print pmix library version
PASS: t0002-basic.t 2 - capture environment with plugin loaded
PASS: t0002-basic.t 3 - PMIX_SERVER_URI* variables all have the same value
PASS: t0002-basic.t 4 - server is listening on localhost
PASS: t0002-basic.t 5 - PMIX_SERVER_TMPDIR == FLUX_JOB_TMPDIR
PASS: t0002-basic.t 6 - 2n4p pmix.job.size is set correctly
PASS: t0002-basic.t 7 - 2n4p pmix.univ.size is set correctly
PASS: t0002-basic.t 8 - 2n3p pmix.local.size is set correctly
PASS: t0002-basic.t 9 - 2n4p pmix.local.size is set correctly
PASS: t0002-basic.t 10 - pmix.tmpdir is set
PASS: t0002-basic.t 11 - pmix.job.napps is set to 1
PASS: t0002-basic.t 12 - pmix.nsdir is NOT set
PASS: t0002-basic.t 13 - 2n4p pmix.hname is set
PASS: t0002-basic.t 14 - 2n3p pmix.lpeers is set correctly
PASS: t0002-basic.t 15 - 2n4p pmix.lpeers is set correctly
PASS: t0002-basic.t 16 - 2n4p pmix.nlist is set
PASS: t0002-basic.t 17 - 2n4p pmix.num.nodes is set correctly
PASS: t0002-basic.t 18 - 2n3p pmix.nodeid is set correctly
PASS: t0002-basic.t 19 - 2n4p pmix.nodeid is set correctly
PASS: t0002-basic.t 20 - 2n3p pmix.lrank is set correctly
PASS: t0002-basic.t 21 - 2n4p pmix.lrank is set correctly
FAIL: t0002-basic.t 22 - 2n3p pmix.srv.rank is set correctly
FAIL: t0002-basic.t 23 - 2n4p pmix.srv.rank is set correctly
PASS: t0002-basic.t 24 - 2n4p pmix.appnum is set correctly
PASS: t0002-basic.t 25 - 2n4p pmix.job.napps is set correctly
PASS: t0002-basic.t 26 - 2n3p pmix.nrank is set correctly
PASS: t0002-basic.t 27 - 2n4p pmix.nrank is set correctly
PASS: t0002-basic.t 28 - 1n1p pmix.tdir.rmclean is true
PASS: t0002-basic.t 29 - 2n4p pmix.max.size is set correctly
PASS: t0002-basic.t 30 - 2n4p pmix.jobid is set
ERROR: t0002-basic.t - exited with status 1
PASS: t0003-barrier.t 1 - print pmix library version
PASS: t0003-barrier.t 2 - 1n2p barrier works
PASS: t0003-barrier.t 3 - 1n2p barrier tolerates pmix.timeout=2
PASS: t0003-barrier.t 4 - 1n2p barrier tolerates pmix.collect=false
SKIP: t0003-barrier.t 5 # SKIP 1n2p barrier rejects required unknown option (missing XFAIL)
PASS: t0003-barrier.t 6 - 1n2p barrier tolerates pmix.collect.gen=false
FAIL: t0003-barrier.t 7 - 1n2p barrier with procs subset works
PASS: t0003-barrier.t 8 - 2n2p barrier works
PASS: t0003-barrier.t 9 - 2n2p barrier tolerates optional pmix.timeout=2
PASS: t0003-barrier.t 10 - 2n2p barrier tolerates optional pmix.collect=false
PASS: t0003-barrier.t 11 - 2n2p barrier tolerates optional pmix.collect.gen=false
SKIP: t0003-barrier.t 12 # SKIP 2n2p barrier rejects required pmix.collect.gen=false (missing XFAIL)
PASS: t0003-barrier.t 13 - 2n2p barrier tolerates optional unknown attr
SKIP: t0003-barrier.t 14 # SKIP 2n2p barrier with procs subset fails (missing XFAIL)
PASS: t0003-barrier.t 15 - 2n2p barrier with procs explictly set fails
PASS: t0003-barrier.t 16 - 2n3p barrier works
PASS: t0003-barrier.t 17 - 2n4p barrier works
ERROR: t0003-barrier.t - exited with status 1
============================================================================
Testsuite summary for flux-pmix 0.1.0-18-g2e42c39
============================================================================
# TOTAL: 101
# PASS:  80
# SKIP:  10
# XFAIL: 1
# FAIL:  7
# XPASS: 0
# ERROR: 3
============================================================================
See t/test-suite.log
============================================================================

For example, in t0002-basic.t the pmix.srv.rank tests fail:

expecting success: 
    cat >2n3p.pmix.srv.rank.exp <<-EOT &&
    0: 0
    1: 0
    2: 1
    EOT
    run_timeout 30 flux mini run -N2 -n3 \
        ${GETKEY} --label-io pmix.srv.rank \
            | sort -n >2n3p.pmix.srv.rank.out &&
    test_cmp 2n3p.pmix.srv.rank.exp 2n3p.pmix.srv.rank.out

f3uhoGvs.0: PMIx_Get pmix.srv.rank: NOT-FOUND
f3uhoGvs.2: PMIx_Get pmix.srv.rank: NOT-FOUND
f3uhoGvs.1: PMIx_Get pmix.srv.rank: NOT-FOUND
flux-job: task(s) exited with exit code 1
--- 2n3p.pmix.srv.rank.exp  2022-01-22 19:38:00.909996713 +0000
+++ 2n3p.pmix.srv.rank.out  2022-01-22 19:38:00.909996713 +0000
@@ -1,3 +0,0 @@
-0: 0
-1: 0
-2: 1
not ok 22 - 2n3p pmix.srv.rank is set correctly
#   
#       cat >2n3p.pmix.srv.rank.exp <<-EOT &&
#       0: 0
#       1: 0
#       2: 1
#       EOT
#       run_timeout 30 flux mini run -N2 -n3 \
#           ${GETKEY} --label-io pmix.srv.rank \
#               | sort -n >2n3p.pmix.srv.rank.out &&
#       test_cmp 2n3p.pmix.srv.rank.exp 2n3p.pmix.srv.rank.out
#   

expecting success: 
    cat >pmix.srv.rank.exp <<-EOT &&
    0: 0
    1: 0
    2: 1
    3: 1
    EOT
    run_timeout 30 flux mini run -N2 -n4 \
        ${GETKEY} --label-io pmix.srv.rank \
            | sort -n >pmix.srv.rank.out &&
    test_cmp pmix.srv.rank.exp pmix.srv.rank.out

f4437EoH.0: PMIx_Get pmix.srv.rank: NOT-FOUND
f4437EoH.3: PMIx_Get pmix.srv.rank: NOT-FOUND
f4437EoH.2: PMIx_Get pmix.srv.rank: NOT-FOUND
f4437EoH.1: PMIx_Get pmix.srv.rank: NOT-FOUND
flux-job: task(s) exited with exit code 1
--- pmix.srv.rank.exp   2022-01-22 19:38:01.233999299 +0000
+++ pmix.srv.rank.out   2022-01-22 19:38:01.233999299 +0000
@@ -1,4 +0,0 @@
-0: 0
-1: 0
-2: 1
-3: 1
not ok 23 - 2n4p pmix.srv.rank is set correctly
#   
#       cat >pmix.srv.rank.exp <<-EOT &&
#       0: 0
#       1: 0
#       2: 1
#       3: 1
#       EOT
#       run_timeout 30 flux mini run -N2 -n4 \
#           ${GETKEY} --label-io pmix.srv.rank \
#               | sort -n >pmix.srv.rank.out &&
#       test_cmp pmix.srv.rank.exp pmix.srv.rank.out
#   

Run by hand if it is any help:

grondo@asp:/usr/src/t$ flux mini run -o verbose=2 -N2 -n3 src/getkey --label-io pmix.srv.rank
0.030s: flux-shell[1]: DEBUG: 1: tasks [2] on cores 0-1
0.031s: flux-shell[1]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.031s: flux-shell[1]: TRACE: Sucessfully loaded flux.shell module
0.031s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.032s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.032s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/mvapich.lua
0.033s: flux-shell[1]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.033s: flux-shell[1]: TRACE: trying to load /usr/src/t/etc/rc.lua
0.041s: flux-shell[1]: DEBUG: pmix: jobid = 183157991145472
0.041s: flux-shell[1]: DEBUG: pmix: shell_rank = 1
0.041s: flux-shell[1]: DEBUG: pmix: local_nprocs = 1
0.041s: flux-shell[1]: DEBUG: pmix: total_nprocs = 3
0.050s: flux-shell[1]: DEBUG: pmix: local_peers = 2
0.050s: flux-shell[1]: DEBUG: pmix: node_map = asp0,asp1
0.050s: flux-shell[1]: DEBUG: pmix: proc_map = 0,1;2
0.053s: flux-shell[1]: TRACE: shell barrier complete
0.053s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.072s: flux-shell[1]: TRACE: shell barrier complete
0.072s: flux-shell[1]: TRACE: exited barrier with rc = 0
0.025s: flux-shell[0]: DEBUG: 0: task_count=3 slot_count=4 cores_per_slot=1 slots_per_node=2
0.025s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
0.027s: flux-shell[0]: DEBUG: Loading /etc/flux/shell/initrc.lua
0.027s: flux-shell[0]: TRACE: Sucessfully loaded flux.shell module
0.027s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/initrc.lua
0.027s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/intel_mpi.lua
0.027s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/mvapich.lua
0.029s: flux-shell[0]: TRACE: trying to load /etc/flux/shell/lua.d/openmpi.lua
0.029s: flux-shell[0]: TRACE: trying to load /usr/src/t/etc/rc.lua
0.032s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.040s: flux-shell[0]: DEBUG: pmix: jobid = 183157991145472
0.040s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
0.041s: flux-shell[0]: DEBUG: pmix: local_nprocs = 2
0.041s: flux-shell[0]: DEBUG: pmix: total_nprocs = 3
0.041s: flux-shell[0]: DEBUG: pmix: server outsourced to 3.1.5
0.049s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.049s: flux-shell[0]: DEBUG: pmix: node_map = asp0,asp1
0.049s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1;2
0.052s: flux-shell[0]: TRACE: shell barrier complete
0.053s: flux-shell[0]: TRACE: exited barrier with rc = 0
0.072s: flux-shell[0]: TRACE: shell barrier complete
0.072s: flux-shell[0]: TRACE: exited barrier with rc = 0
ƒ2RxFVUyxs.0: PMIx_Get pmix.srv.rank: NOT-FOUND
ƒ2RxFVUyxs.2: PMIx_Get pmix.srv.rank: NOT-FOUND
ƒ2RxFVUyxs.1: PMIx_Get pmix.srv.rank: NOT-FOUND
0.077s: flux-shell[1]: TRACE: pmi: 2: C: pmi EOF
0.078s: flux-shell[1]: DEBUG: task 2 complete status=1
0.078s: flux-shell[0]: TRACE: pmi: 0: C: pmi EOF
0.078s: flux-shell[0]: DEBUG: task 0 complete status=1
0.082s: flux-shell[0]: TRACE: pmi: 1: C: pmi EOF
0.085s: flux-shell[0]: DEBUG: task 1 complete status=1
0.089s: flux-shell[1]: DEBUG: exit 1
0.091s: flux-shell[0]: DEBUG: exit 1
flux-job: task(s) exited with exit code 1
garlick commented 2 years ago

This is working for me now (focal image, ompi v4.0.x branch after #56 was merged.