flux-framework / flux-pmix

flux shell plugin to bootstrap openmpi v5+
GNU Lesser General Public License v3.0
2 stars 4 forks source link

openmpi 4.1.2-2ubuntu1 fails with missing munge component #74

Open garlick opened 1 year ago

garlick commented 1 year ago

Problem: on Ubuntu 22.04.1 LTS, flux-pmix fails make check when built with an external openpmix-4.2.2 (default configure options) and openmpi-4.1.2-2ubuntu1 is installed:

expecting success: 
    run_timeout 30 flux mini run -overbose=2 -N1 -n2 \
        ${MPI_HELLO} >hello_1n2p.out &&
    grep "There are 2 tasks" hello_1n2p.out

0.027s: flux-shell[0]: DEBUG: Loading /opt/flux-core-v0.46.1-54/etc/flux/shell/initrc.lua
0.027s: flux-shell[0]: TRACE: Sucessfully loaded flux.shell module
0.027s: flux-shell[0]: TRACE: trying to load /opt/flux-core-v0.46.1-54/etc/flux/shell/initrc.lua
0.027s: flux-shell[0]: TRACE: trying to load /opt/flux-core-v0.46.1-54/etc/flux/shell/lua.d/intel_mpi.lua
0.027s: flux-shell[0]: TRACE: trying to load /opt/flux-core-v0.46.1-54/etc/flux/shell/lua.d/mvapich.lua
0.028s: flux-shell[0]: TRACE: trying to load /opt/flux-core-v0.46.1-54/etc/flux/shell/lua.d/openmpi.lua
0.028s: flux-shell[0]: TRACE: trying to load /home/garlick/proj/flux-pmix/t/etc/rc.lua
0.029s: flux-shell[0]: DEBUG: output: batch timeout = 0.500s
0.030s: flux-shell[0]: DEBUG: pmix: jobid = 13690208256
0.030s: flux-shell[0]: DEBUG: pmix: shell_rank = 0
0.030s: flux-shell[0]: DEBUG: pmix: local_nprocs = 2
0.030s: flux-shell[0]: DEBUG: pmix: total_nprocs = 2
0.030s: flux-shell[0]: DEBUG: pmix: server outsourced to OpenPMIx 4.2.2rc2
0.052s: flux-shell[0]: DEBUG: pmix: local_peers = 0,1
0.052s: flux-shell[0]: DEBUG: pmix: node_map = system76-pc
0.052s: flux-shell[0]: DEBUG: pmix: proc_map = 0,1
0.052s: flux-shell[0]: DEBUG: 0: task_count=2 slot_count=2 cores_per_slot=1 slots_per_node=2
0.052s: flux-shell[0]: DEBUG: 0: tasks [0-1] on cores 0-1
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.

Host:      system76-pc
Framework: psec
Component: munge
--------------------------------------------------------------------------

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.

Host:      system76-pc
Framework: psec
Component: munge
--------------------------------------------------------------------------

[system76-pc:159601] PMIX ERROR: PACK-MISMATCH in file ../../../src/client/pmix_client.c at line 832
[system76-pc:159601] OPAL ERROR: Pack data mismatch in file ext3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[system76-pc:159601] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.

Host:      system76-pc
Framework: psec
Component: munge
--------------------------------------------------------------------------

--------------------------------------------------------------------------
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIx stopped checking at the first component that it did not find.

Host:      system76-pc
Framework: psec
Component: munge
--------------------------------------------------------------------------

[system76-pc:159602] PMIX ERROR: PACK-MISMATCH in file ../../../src/client/pmix_client.c at line 832
[system76-pc:159602] OPAL ERROR: Pack data mismatch in file ext3x_client.c at line 112
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[system76-pc:159602] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
0.061s: flux-shell[0]: TRACE: pmi: 0: C: pmi EOF
0.061s: flux-shell[0]: DEBUG: task 0 complete status=1
0.061s: flux-shell[0]: TRACE: pmi: 1: C: pmi EOF
0.061s: flux-shell[0]: DEBUG: task 1 complete status=1
0.071s: flux-shell[0]: DEBUG: exit 1

Neither openmpi's built-in libpmix nor the side-installed 4.2.2 used to build flux-pmix have a psec_munge plugin installed as a separate DSO. However, rebuilding openpmix-4.2.2 with --without-munge does resolve the problem.

Based on the pack error, it would appear that the requirement for munge is not negotiated between client and server - it changes the wire protocol and mismatched configurations cannot interoperate. See also https://bugs.schedmd.com/show_bug.cgi?id=12396