NERSC / podman-hpc

Other
35 stars 7 forks source link

support pmix, openmpi5 #97

Open lastephey opened 11 months ago

lastephey commented 11 months ago

Our recent work will help support pmi2/openmpi4 and older.

Of course we need to support newer pmix and openmpi5.

Looks like pmix doesn't use file descriptors the way pmi2 did. Here's some diagnostic output:

stephey@nid001005:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi -v $(pwd):/work -w /work openmpi:test ./print.sh 
PMI_FD:

PMIX_HOSTNAME=nid001005
PMIX_SECURITY_MODE=munge,native
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_SERVER_URI41=pmix-server.713845;tcp4://127.0.0.1:37563
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_DSTORE_21_BASE_PATH=/var/spool/slurmd/pmix.488141.1//pmix_dstor_ds21_713845
PMIX_VERSION=4.2.3
PMIX_MCA_psec=native
PMI_SHARED_SECRET=18069639051618979834
SLURM_PMIX_MAPPING_SERV=(vector,(0,1,2))
PMIX_RANK=0
PMIX_SERVER_URI2=pmix-server.713845;tcp4://127.0.0.1:37563
PMIX_SERVER_URI3=pmix-server.713845;tcp4://127.0.0.1:37563
PMIX_SERVER_URI4=pmix-server.713845;tcp4://127.0.0.1:37563
SLURM_PMIXP_ABORT_AGENT_PORT=32813
PMIX_SERVER_URI21=pmix-server.713845;tcp4://127.0.0.1:37563
PMIX_DSTORE_ESH_BASE_PATH=/var/spool/slurmd/pmix.488141.1//pmix_dstor_ds12_713845
PMIX_SERVER_TMPDIR=/var/spool/slurmd/pmix.488141.1/
PMIX_NAMESPACE=slurm.pmix.488141.1
contents of /proc/self/fd
0
1
2
255
lastephey commented 11 months ago

Well no, I take it back. It's set in shifter:

stephey@nid001005:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmix shifter --module=none --image=stephey/openmpi:test ./print.sh
PMI_FD:
11
PMI_FD=11
PMIX_HOSTNAME=nid001005
PMIX_SECURITY_MODE=munge,native
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_SERVER_URI41=pmix-server.727117;tcp4://127.0.0.1:47093
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_DSTORE_21_BASE_PATH=/var/spool/slurmd/pmix.488143.5//pmix_dstor_ds21_727117
PMIX_VERSION=4.2.3
PMIX_MCA_psec=native
PMI_SHARED_SECRET=5384568399703742230
SLURM_PMIX_MAPPING_SERV=(vector,(0,1,2))
PMIX_RANK=0
PMIX_SERVER_URI2=pmix-server.727117;tcp4://127.0.0.1:47093
PMIX_SERVER_URI3=pmix-server.727117;tcp4://127.0.0.1:47093
PMIX_SERVER_URI4=pmix-server.727117;tcp4://127.0.0.1:47093
SLURM_PMIXP_ABORT_AGENT_PORT=37411
PMIX_SERVER_URI21=pmix-server.727117;tcp4://127.0.0.1:47093
PMIX_DSTORE_ESH_BASE_PATH=/var/spool/slurmd/pmix.488143.5//pmix_dstor_ds12_727117
PMIX_SERVER_TMPDIR=/var/spool/slurmd/pmix.488143.5/
PMIX_NAMESPACE=slurm.pmix.488143.5
contents of /proc/self/fd
0
1
11
2
255
lastephey commented 11 months ago

I think one difference is that PMIX doesn't set PMI_FD. If I set it myself to 11, I can see the duped FD as we expect:

stephey@nid001005:/mscratch/sd/s/stephey/openmpi> echo $PMI_FD
11
stephey@nid001005:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi -v $(pwd):/work -w /work openmpi:test ./print.sh
PMI_FD:
3
PMI_FD=3
PMIX_HOSTNAME=nid001005
PMIX_SECURITY_MODE=munge,native
PMIX_GDS_MODULE=ds21,ds12,hash
PMIX_SYSTEM_TMPDIR=/tmp
PMIX_SERVER_URI41=pmix-server.752313;tcp4://127.0.0.1:56337
PMIX_BFROP_BUFFER_TYPE=PMIX_BFROP_BUFFER_NON_DESC
PMIX_DSTORE_21_BASE_PATH=/var/spool/slurmd/pmix.488152.9//pmix_dstor_ds21_752313
PMIX_VERSION=4.2.3
PMIX_MCA_psec=native
PMI_SHARED_SECRET=2513919418766517941
SLURM_PMIX_MAPPING_SERV=(vector,(0,1,2))
PMIX_RANK=0
PMIX_SERVER_URI2=pmix-server.752313;tcp4://127.0.0.1:56337
PMIX_SERVER_URI3=pmix-server.752313;tcp4://127.0.0.1:56337
PMIX_SERVER_URI4=pmix-server.752313;tcp4://127.0.0.1:56337
SLURM_PMIXP_ABORT_AGENT_PORT=39687
PMIX_SERVER_URI21=pmix-server.752313;tcp4://127.0.0.1:56337
PMIX_DSTORE_ESH_BASE_PATH=/var/spool/slurmd/pmix.488152.9//pmix_dstor_ds12_752313
PMIX_SERVER_TMPDIR=/var/spool/slurmd/pmix.488152.9/
PMIX_NAMESPACE=slurm.pmix.488152.9
contents of /proc/self/fd
0
1
2
255
3

But actually trying to run an mpi test hangs. We'll have to strace.

stephey@nid001005:/mscratch/sd/s/stephey/openmpi> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
lastephey commented 11 months ago

I've crashed my node trying to strace this. Some of the messages that Chris saw while trying to debug:

muller:nid001036:/mscratch/sd/c/csamuel # gdb -batch -ex 'bt' /usr/sbin/slurmstepd ./core.*5 2>&1
warning: Can't open file /var/spool/slurmd/pmix.488158.7/pmix_dstor_ds12_49469/dstore_sm.lock during file-backed mapping note processing
warning: Can't open file /var/spool/slurmd/pmix.488158.7/pmix_dstor_ds12_49469/initial-pmix_shared-segment-0 during file-backed mapping note processing
warning: Can't open file /run/nscd/dbDwHYeo (deleted) during file-backed mapping note processing
warning: Can't open file /var/spool/slurmd/pmix.488158.7/pmix_dstor_ds21_49469/smlockseg-slurm.pmix.488158.7 during file-backed mapping note processing
warning: Can't open file /var/spool/slurmd/pmix.488158.7/pmix_dstor_ds21_49469/initial-pmix_shared-segment-0 during file-backed mapping note processing

and

2023-11-09 21:00:56 nersc-nodeepilog|488155|Detecting stray cgroups - checking for core file
2023-11-09 21:00:56 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.707718.nid001005
2023-11-09 21:00:57 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.709249.nid001005
2023-11-09 21:00:57 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.710537.nid001005
2023-11-09 21:00:57 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.715111.nid001005
2023-11-09 21:00:58 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.721345.nid001005
2023-11-09 21:00:58 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.728382.nid001005
2023-11-09 21:00:58 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.746324.nid001005
2023-11-09 21:00:58 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.753349.nid001005
2023-11-09 21:00:59 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.761527.nid001005
2023-11-09 21:00:59 nersc-nodeepilog|488155|Detecting stray cgroups - backtrace for core file /var/spool/slurmd/core.slurmstepd.769198.nid001005
2023-11-09 21:00:59 nersc-nodeepilog|488155|Detecting stray cgroups - done capturing info
lastephey commented 11 months ago

Update: with --userns=keep-id and shared-run=False I did finally get this to work!

stephey@nid001005:~> export PMIX_MCA_psec=native
stephey@nid001005:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi -v $(pwd):/work -w /work --userns=keep-id openmpi:test python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid001005.
Hello, World! I am process 1 of 2 on nid001005.
stephey@nid001005:~> 
muller:nid001005:/etc/podman_hpc/modules.d # cat openmpi.yaml 
name: openmpi
cli_arg: openmpi
help: Enable OpenMPI Support
env: ENABLE_OPENMPI
shared_run: False
additional_args:
  - -e SLURM_*
  - -e SLURMD_*
  - -e PALS_*
  - -e PMI_*
  - -e PMIX_*
  - --ipc=host
  - --network=host
  - --pid=host
  - --privileged
bind:
  - /dev/xpmem:/dev/xpmem
  - /dev/shm:/dev/shm
  - /dev/ss0:/dev/ss0
  - /dev/cxi*:/dev/
  - /var/spool/slurmd:/var/spool/slurmd
  - /run/munge:/run/munge
  - /run/nscd:/run/nscd
  - /etc/libibverbs.d:/etc/libibverbs.d
muller:nid001005:/etc/podman_hpc/modules.d # 
lastephey commented 11 months ago

Add some initial support in https://github.com/NERSC/podman-hpc/pull/96

Will still need to figure out what is going wrong with shared-run, so I'll leave this open.

With shared-run=True this is the error:

stephey@nid001005:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001005][[15063,0],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[nid001005][[15063,0],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[nid001005][[15063,0],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
[nid001005][[15063,0],0][btl_tcp_proc.c:400:mca_btl_tcp_proc_create] opal_modex_recv: failed with return value=-46
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
slurmstepd: error: *** STEP 488168.3 ON nid001005 CANCELLED AT 2023-11-09T23:13:46 ***
srun: error: nid001005: tasks 0-1: Killed
srun: Terminating StepId=488168.3
stephey@nid001005:~> 
lastephey commented 10 months ago

Testing with Howard Pritchard. He suggested mpirun:

stephey@nid001003:~> export PMIX_MCA_psec=native
stephey@nid001003:~> mpirun -np 2 podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid001003.
Hello, World! I am process 1 of 2 on nid001003.
stephey@nid001003:~> 
lastephey commented 10 months ago
stephey@nid001003:~> export PMIX_MCA_pmix_server_base_verbose=100
stephey@nid001003:~> export PMIX_MCA_pmix_client_base_verbose=100
stephey@nid001003:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001003:949863] pmix: init called
[nid001003:949840] pmix: init called
[nid001003:949863] [slurm.pmix.491379.3:1] NO DEBUGGER WAITING
(gdb) bt
#0  0x00007febefbe771e in read () from target:/lib64/libpthread.so.0
#1  0x0000000000416ec5 in read (__nbytes=1, __buf=0x7fffa4e9b8cf, __fd=<optimized out>)
    at /usr/include/bits/unistd.h:44
#2  _shepherd_spawn (got_alloc=false, srun_job_list=0x0, job=0x4c9150) at srun_job.c:2325
#3  create_srun_job (p_job=p_job@entry=0x425ca8 <job>, got_alloc=got_alloc@entry=0x7fffa4e9b973, 
    slurm_started=slurm_started@entry=false, handle_signals=handle_signals@entry=true)
    at srun_job.c:1545
#4  0x0000000000411a8b in srun (ac=17, av=0x7fffa4e9bba8) at srun.c:195
#5  0x0000000000417469 in main (argc=<optimized out>, argv=<optimized out>) at srun.wrapper.c:17
(gdb) exit
A debugging session is active.
ps -l -u stephey 
gdb -p 949646

Compare to case with --userns=keep-id that works

stephey@nid001003:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001003:949452] pmix: init called
[nid001003:949420] pmix: init called
[nid001003:949452] [slurm.pmix.491379.2:0] NO DEBUGGER WAITING
[nid001003:949452] pmix: executing put for key smsc.cma.5.0 type 27
[nid001003:949452] pmix: executing put for key btl.sm.5.0 type 27
[nid001003:949452] pmix: executing put for key btl.tcp.5.0 type 27
[nid001003:949452] pmix: executing put for key pml.base.2.0 type 27
[nid001003:949452] pmix:client wait_cbfunc received
[nid001003:949420] [slurm.pmix.491379.2:1] NO DEBUGGER WAITING
[nid001003:949420] pmix: executing put for key smsc.cma.5.0 type 27
[nid001003:949420] pmix: executing put for key btl.sm.5.0 type 27
[nid001003:949420] pmix: executing put for key btl.tcp.5.0 type 27
[nid001003:949420] pmix:client wait_cbfunc received
Hello, World! I am process 0 of 2 on nid001003.
Hello, World! I am process 1 of 2 on nid001003.
[nid001003:949452] slurm.pmix.491379.2:0 pmix:client finalize called
[nid001003:949452] slurm.pmix.491379.2:0 pmix:client sending finalize sync to server
[nid001003:949452] pmix:client finwait_cbfunc received
[nid001003:949452] slurm.pmix.491379.2:0 pmix:client finalize sync received
[nid001003:949420] slurm.pmix.491379.2:1 pmix:client finalize called
[nid001003:949420] slurm.pmix.491379.2:1 pmix:client sending finalize sync to server
[nid001003:949420] pmix:client finwait_cbfunc received
[nid001003:949420] slurm.pmix.491379.2:1 pmix:client finalize sync received
lastephey commented 10 months ago
stephey@nid001005:~> export PMIX_MCA_pmix_server_base_verbose=100
stephey@nid001005:~> export PMIX_MCA_pmix_client_base_verbose=100
stephey@nid001005:~> export PMIX_MCA_psec=native
stephey@nid001005:~> srun --mpi=pmix -n 1 podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001005:1089348] pmix: init called
[nid001005:1089348] [slurm.pmix.491415.0:0] NO DEBUGGER WAITING
[nid001005:1089348] pmix: executing put for key smsc.cma.5.0 type 27
[nid001005:1089348] pmix: executing put for key btl.tcp.5.0 type 27
[nid001005:1089348] pmix: executing put for key pml.base.2.0 type 27
[nid001005:1089348] pmix:client wait_cbfunc received
Hello, World! I am process 0 of 1 on nid001005.
[nid001005:1089348] slurm.pmix.491415.0:0 pmix:client finalize called
[nid001005:1089348] slurm.pmix.491415.0:0 pmix:client sending finalize sync to server
[nid001005:1089348] pmix:client finwait_cbfunc received
[nid001005:1089348] slurm.pmix.491415.0:0 pmix:client finalize sync received
stephey@nid001005:~> mpirun -np 1 podman-hpc run --rm --openmpi-pmix -v $(pwd):/work -w /work openmpi:test python3 -m mpi4py.bench helloworld
[nid001005:1090670] pmix:server register resources
[nid001005:1090670] pmix:server register client prterun-nid001005-1090670@1:0
[nid001005:1090670] pmix:server _register_client for nspace prterun-nid001005-1090670@1 rank 0 NON-NULL object
[nid001005:1090670] pmix:server _register_nspace prterun-nid001005-1090670@1
[nid001005:1090670] pmix:server setup_fork for nspace prterun-nid001005-1090670@1 rank 0
[nid001005:1090670] pmix:server deregister nspace prterun-nid001005-1090670@1
[nid001005:1090670] pmix:server _deregister_nspace prterun-nid001005-1090670@1
[nid001005:1090670] pmix:server deregister client prterun-nid001005-1090670@1:0
[nid001005:1090670] pmix:server _deregister_client for nspace prterun-nid001005-1090670@1 rank 0
stephey@nid001005:~> 
lastephey commented 10 months ago

Rebuild image with 4.1.6 with both pmi2 and pmix- make sure we can toggle

Focus on srun, not mpirun

Openmpi 4 will ship with pmix3- we need to override with external pmix 4 if possible

lastephey commented 10 months ago

I think we've achieved this with the following recipe:

FROM ubuntu:jammy
WORKDIR /opt

ENV DEBIAN_FRONTEND noninteractive

RUN apt-get update && apt-get install -y \
        build-essential \
        ca-certificates \
        automake \
        autoconf \
        wget \
        libpmi2-0-dev \
        libpmix-dev \
    --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

ARG openmpi_version=4.1.6

RUN wget https://download.open-mpi.org/release/open-mpi/v4.1/openmpi-$openmpi_version.tar.gz \
    && tar xf openmpi-$openmpi_version.tar.gz \
    && cd openmpi-$openmpi_version \
    && CFLAGS=-I/usr/include/slurm ./configure \
       --prefix=/opt/openmpi --with-slurm \
       --with-pmi=/usr/include/slurm --with-pmi-libdir=/usr/lib/x86_64-linux-gnu \
       --with-pmix=external --with-pmix=/usr/lib/x86_64-linux-gnu/pmix2 \
    && make -j 32 \
    && make install \
    && cd .. \
    && rm -rf openmpi-$openmpi_version.tar.gz openmpi-$openmpi_version

RUN /sbin/ldconfig

ENV PATH=/opt/openmpi/bin:$PATH

but I'll have to test.

lastephey commented 10 months ago

This recipe did produce a working image build.

Now testing with it. pmi2 appears to work and pmix fails.

pmi2:

stephey@nid200022:~> srun -n 2 --mpi=pmi2 podman-hpc run --rm --openmpi-pmi2 openmpi:pmix python3 -m mpi4py.bench helloworld
Hello, World! I am process 0 of 2 on nid200022.
Hello, World! I am process 1 of 2 on nid200022.
stephey@nid200022:~> 

pmix with userns=keep-id, with user namespace complaint

stephey@nid200022:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix openmpi:pmix python3 -m mpi4py.bench helloworld
--------------------------------------------------------------------------
WARNING: The default btl_vader_single_copy_mechanism CMA is
not available due to different user namespaces.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: nid200022
--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: The default btl_vader_single_copy_mechanism CMA is
not available due to different user namespaces.

The vader shared memory BTL will fall back on another single-copy
mechanism if one is available. This may result in lower performance.

  Local host: nid200022
--------------------------------------------------------------------------
Hello, World! I am process 0 of 2 on nid200022.
Hello, World! I am process 1 of 2 on nid200022.

pmix without userns=keep-id

stephey@nid200022:~> srun -n 2 --mpi=pmix podman-hpc run --rm --openmpi-pmix openmpi:pmix python3 -m mpi4py.bench helloworld
[nid200022:526535] OPAL ERROR: Unreachable in file ext3x_client.c at line 111
[nid200022:526561] OPAL ERROR: Unreachable in file ext3x_client.c at line 111
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[nid200022:526561] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------