Open lastephey opened 11 months ago
Well no, I take it back. It's set in shifter:
stephey@nid001005:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmix shifter --module=none --image=stephey/openmpi:test ./print.sh
PMI_FD:
11
PMI_FD=11
PMIX_HOSTNAME=nid001005
PMIX_SECURITY_MODE=munge,native
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_SERVER_URI41=pmix-server.727117;tcp4://127.0.0.1:47093
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_DSTORE_21_BASE_PATH=/var/spool/slurmd/pmix.488143.5//pmix_dstor_ds21_727117
PMIX_VERSION=4.2.3
PMIX_MCA_psec=native
PMI_SHARED_SECRET=5384568399703742230
SLURM_PMIX_MAPPING_SERV=(vector,(0,1,2))
PMIX_RANK=0
PMIX_SERVER_URI2=pmix-server.727117;tcp4://127.0.0.1:47093
PMIX_SERVER_URI3=pmix-server.727117;tcp4://127.0.0.1:47093
PMIX_SERVER_URI4=pmix-server.727117;tcp4://127.0.0.1:47093
SLURM_PMIXP_ABORT_AGENT_PORT=37411
PMIX_SERVER_URI21=pmix-server.727117;tcp4://127.0.0.1:47093
PMIX_DSTORE_ESH_BASE_PATH=/var/spool/slurmd/pmix.488143.5//pmix_dstor_ds12_727117
PMIX_SERVER_TMPDIR=/var/spool/slurmd/pmix.488143.5/
PMIX_NAMESPACE=slurm.pmix.488143.5
contents of /proc/self/fd
0
1
11
2
255
I think one difference is that PMIX doesn't set PMI_FD. If I set it myself to 11, I can see the duped FD as we expect:
stephey@nid001005:/mscratch/sd/s/stephey/openmpi> echo $PMI_FD
11
stephey@nid001005:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi -v $(pwd):/work -w /work openmpi:test ./print.sh
PMI_FD:
3
PMI_FD=3
PMIX_HOSTNAME=nid001005
PMIX_SECURITY_MODE=munge,native
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_SERVER_URI41=pmix-server.752313;tcp4://127.0.0.1:56337
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_DSTORE_21_BASE_PATH=/var/spool/slurmd/pmix.488152.9//pmix_dstor_ds21_752313
PMIX_VERSION=4.2.3
PMIX_MCA_psec=native
PMI_SHARED_SECRET=2513919418766517941
SLURM_PMIX_MAPPING_SERV=(vector,(0,1,2))
PMIX_RANK=0
PMIX_SERVER_URI2=pmix-server.752313;tcp4://127.0.0.1:56337
PMIX_SERVER_URI3=pmix-server.752313;tcp4://127.0.0.1:56337
PMIX_SERVER_URI4=pmix-server.752313;tcp4://127.0.0.1:56337
SLURM_PMIXP_ABORT_AGENT_PORT=39687
PMIX_SERVER_URI21=pmix-server.752313;tcp4://127.0.0.1:56337
PMIX_DSTORE_ESH_BASE_PATH=/var/spool/slurmd/pmix.488152.9//pmix_dstor_ds12_752313
PMIX_SERVER_TMPDIR=/var/spool/slurmd/pmix.488152.9/
PMIX_NAMESPACE=slurm.pmix.488152.9
contents of /proc/self/fd
0
1
2
255
3
But actually trying to run an mpi test hangs. We'll have to strace.
stephey@nid001005:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
I've crashed my node trying to strace this. Some of the messages that Chris saw while trying to debug:
muller:nid001036:/mscratch/sd/c/csamuel # gdb -batch -ex 'bt' /usr/sbin/slurmstepd ./core.*5 2>&1
warning: Can't open file /var/spool/slurmd/pmix.488158.7/pmix_dstor_ds12_49469/dstore_sm.lock during file-backed mapping note processing
warning: Can't open file /var/spool/slurmd/pmix.488158.7/pmix_dstor_ds12_49469/initial-pmix_shared-segment-0 during file-backed mapping note processing
warning: Can't open file /run/nscd/dbDwHYeo (deleted) during file-backed mapping note processing
warning: Can't open file /var/spool/slurmd/pmix.488158.7/pmix_dstor_ds21_49469/smlockseg-slurm.pmix.488158.7 during file-backed mapping note processing
warning: Can't open file /var/spool/slurmd/pmix.488158.7/pmix_dstor_ds21_49469/initial-pmix_shared-segment-0 during file-backed mapping note processing
and
2023-11-09 21:00:56 nersc-nodeepilog|488155|Detecting stray cgroups - checking for core file
2023-11-09 21:00:56 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.707718.nid001005
2023-11-09 21:00:57 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.709249.nid001005
2023-11-09 21:00:57 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.710537.nid001005
2023-11-09 21:00:57 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.715111.nid001005
2023-11-09 21:00:58 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.721345.nid001005
2023-11-09 21:00:58 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.728382.nid001005
2023-11-09 21:00:58 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.746324.nid001005
2023-11-09 21:00:58 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.753349.nid001005
2023-11-09 21:00:59 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.761527.nid001005
2023-11-09 21:00:59 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.769198.nid001005
2023-11-09 21:00:59 nersc-nodeepilog|488155|Detecting stray cgroups - done capturing info
Update: with --userns=keep-id
and shared-run=False
I did finally get this to work!
stephey@nid001005:~> export PMIX_MCA_psec=native
stephey@nid001005:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi -v $(pwd):/work -w /work --userns=keep-id openmpi:test python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid001005.
Hello, World! I am process 1 of 2 on nid001005.
stephey@nid001005:~>
muller:nid001005:/etc/podman_hpc/modules.d # cat openmpi.yaml
name: openmpi
cli_arg: openmpi
help: Enable OpenMPI Support
env: ENABLE_OPENMPI
shared_run: False
additional_args:
- -e SLURM_*
- -e SLURMD_*
- -e PALS_*
- -e PMI_*
- -e PMIX_*
- --ipc=host
- --network=host
- --pid=host
- --privileged
bind:
- /dev/xpmem:/dev/xpmem
- /dev/shm:/dev/shm
- /dev/ss0:/dev/ss0
- /dev/cxi*:/dev/
- /var/spool/slurmd:/var/spool/slurmd
- /run/munge:/run/munge
- /run/nscd:/run/nscd
- /etc/libibverbs.d:/etc/libibverbs.d
muller:nid001005:/etc/podman_hpc/modules.d #
Add some initial support in https://github.com/NERSC/podman-hpc/pull/96
Will still need to figure out what is going wrong with shared-run, so I'll leave this open.
With shared-run=True
this is the error:
stephey@nid001005:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001005][[15063,0],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[nid001005][[15063,0],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[nid001005][[15063,0],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[nid001005][[15063,0],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 488168.3 ON nid001005 CANCELLED AT 2023-11-09T23:13:46 ***
srun: error: nid001005: tasks 0-1: Killed
srun: Terminating StepId=488168.3
stephey@nid001005:~>
Testing with Howard Pritchard. He suggested mpirun:
stephey@nid001003:~> export PMIX_MCA_psec=native
stephey@nid001003:~> mpirun -np 2 podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid001003.
Hello, World! I am process 1 of 2 on nid001003.
stephey@nid001003:~>
stephey@nid001003:~> export PMIX_MCA_pmix_server_base_verbose=100
stephey@nid001003:~> export PMIX_MCA_pmix_client_base_verbose=100
stephey@nid001003:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001003:949863] pmix: init called
[nid001003:949840] pmix: init called
[nid001003:949863] [slurm.pmix.491379.3:1] NO DEBUGGER WAITING
(gdb) bt
#0 0x00007febefbe771e in read () from target:/lib64/libpthread.so.0
#1 0x0000000000416ec5 in read (__nbytes=1, __buf=0x7fffa4e9b8cf, __fd=<optimized out>)
at /usr/include/bits/unistd.h:44
#2 _shepherd_spawn (got_alloc=false, srun_job_list=0x0, job=0x4c9150) at srun_job.c:2325
#3 create_srun_job (p_job=p_job@entry=0x425ca8 <job>, got_alloc=got_alloc@entry=0x7fffa4e9b973,
slurm_started=slurm_started@entry=false, handle_signals=handle_signals@entry=true)
at srun_job.c:1545
#4 0x0000000000411a8b in srun (ac=17, av=0x7fffa4e9bba8) at srun.c:195
#5 0x0000000000417469 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17
(gdb) exit
A debugging session is active.
ps -l -u stephey
gdb -p 949646
Compare to case with --userns=keep-id
that works
stephey@nid001003:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001003:949452] pmix: init called
[nid001003:949420] pmix: init called
[nid001003:949452] [slurm.pmix.491379.2:0] NO DEBUGGER WAITING
[nid001003:949452] pmix: executing put for key smsc.cma.5.0 type 27
[nid001003:949452] pmix: executing put for key btl.sm.5.0 type 27
[nid001003:949452] pmix: executing put for key btl.tcp.5.0 type 27
[nid001003:949452] pmix: executing put for key pml.base.2.0 type 27
[nid001003:949452] pmix:client wait_cbfunc received
[nid001003:949420] [slurm.pmix.491379.2:1] NO DEBUGGER WAITING
[nid001003:949420] pmix: executing put for key smsc.cma.5.0 type 27
[nid001003:949420] pmix: executing put for key btl.sm.5.0 type 27
[nid001003:949420] pmix: executing put for key btl.tcp.5.0 type 27
[nid001003:949420] pmix:client wait_cbfunc received
Hello, World! I am process 0 of 2 on nid001003.
Hello, World! I am process 1 of 2 on nid001003.
[nid001003:949452] slurm.pmix.491379.2:0 pmix:client finalize called
[nid001003:949452] slurm.pmix.491379.2:0 pmix:client sending finalize sync to server
[nid001003:949452] pmix:client finwait_cbfunc received
[nid001003:949452] slurm.pmix.491379.2:0 pmix:client finalize sync received
[nid001003:949420] slurm.pmix.491379.2:1 pmix:client finalize called
[nid001003:949420] slurm.pmix.491379.2:1 pmix:client sending finalize sync to server
[nid001003:949420] pmix:client finwait_cbfunc received
[nid001003:949420] slurm.pmix.491379.2:1 pmix:client finalize sync received
stephey@nid001005:~> export PMIX_MCA_pmix_server_base_verbose=100
stephey@nid001005:~> export PMIX_MCA_pmix_client_base_verbose=100
stephey@nid001005:~> export PMIX_MCA_psec=native
stephey@nid001005:~> srun --mpi=pmix -n 1 podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001005:1089348] pmix: init called
[nid001005:1089348] [slurm.pmix.491415.0:0] NO DEBUGGER WAITING
[nid001005:1089348] pmix: executing put for key smsc.cma.5.0 type 27
[nid001005:1089348] pmix: executing put for key btl.tcp.5.0 type 27
[nid001005:1089348] pmix: executing put for key pml.base.2.0 type 27
[nid001005:1089348] pmix:client wait_cbfunc received
Hello, World! I am process 0 of 1 on nid001005.
[nid001005:1089348] slurm.pmix.491415.0:0 pmix:client finalize called
[nid001005:1089348] slurm.pmix.491415.0:0 pmix:client sending finalize sync to server
[nid001005:1089348] pmix:client finwait_cbfunc received
[nid001005:1089348] slurm.pmix.491415.0:0 pmix:client finalize sync received
stephey@nid001005:~> mpirun -np 1 podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001005:1090670] pmix:server register resources
[nid001005:1090670] pmix:server register client prterun-nid001005-1090670@1:0
[nid001005:1090670] pmix:server _register_client for nspace prterun-nid001005-1090670@1 rank 0 NON-NULL object
[nid001005:1090670] pmix:server _register_nspace prterun-nid001005-1090670@1
[nid001005:1090670] pmix:server setup_fork for nspace prterun-nid001005-1090670@1 rank 0
[nid001005:1090670] pmix:server deregister nspace prterun-nid001005-1090670@1
[nid001005:1090670] pmix:server _deregister_nspace prterun-nid001005-1090670@1
[nid001005:1090670] pmix:server deregister client prterun-nid001005-1090670@1:0
[nid001005:1090670] pmix:server _deregister_client for nspace prterun-nid001005-1090670@1 rank 0
stephey@nid001005:~>
Rebuild image with 4.1.6 with both pmi2 and pmix- make sure we can toggle
Focus on srun, not mpirun
Openmpi 4 will ship with pmix3- we need to override with external pmix 4 if possible
I think we've achieved this with the following recipe:
FROM ubuntu:jammy
WORKDIR /opt
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y \
build-essential \
ca-certificates \
automake \
autoconf \
wget \
libpmi2-0-dev \
libpmix-dev \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
ARG openmpi_version=4.1.6
RUN wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-$openmpi_version.tar.gz \
&& tar xf openmpi-$openmpi_version.tar.gz \
&& cd openmpi-$openmpi_version \
&& CFLAGS=-I/usr/include/slurm ./configure \
--prefix=/opt/openmpi --with-slurm \
--with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/lib/x86_64-linux-gnu \
--with-pmix=external --with-pmix=/usr/lib/x86_64-linux-gnu/pmix2 \
&& make -j 32 \
&& make install \
&& cd .. \
&& rm -rf openmpi-$openmpi_version.tar.gz openmpi-$openmpi_version
RUN /sbin/ldconfig
ENV PATH=/opt/openmpi/bin:$PATH
but I'll have to test.
This recipe did produce a working image build.
Now testing with it. pmi2 appears to work and pmix fails.
pmi2:
stephey@nid200022:~> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 openmpi:pmix python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid200022.
Hello, World! I am process 1 of 2 on nid200022.
stephey@nid200022:~>
pmix with userns=keep-id, with user namespace complaint
stephey@nid200022:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix openmpi:pmix python3 -m mpi4py.bench helloworld
--------------------------------------------------------------------------
WARNING: The default btl_vader_single_copy_mechanism CMA is
not available due to different user namespaces.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: nid200022
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: The default btl_vader_single_copy_mechanism CMA is
not available due to different user namespaces.
The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.
Local host: nid200022
--------------------------------------------------------------------------
Hello, World! I am process 0 of 2 on nid200022.
Hello, World! I am process 1 of 2 on nid200022.
pmix without userns=keep-id
stephey@nid200022:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix openmpi:pmix python3 -m mpi4py.bench helloworld
[nid200022:526535] OPAL ERROR: Unreachable in file ext3x_client.c at line 111
[nid200022:526561] OPAL ERROR: Unreachable in file ext3x_client.c at line 111
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[nid200022:526561] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
Our recent work will help support pmi2/openmpi4 and older.
Of course we need to support newer pmix and openmpi5.
Looks like pmix doesn't use file descriptors the way pmi2 did. Here's some diagnostic output: