NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
266 stars 30 forks source link

Mpix error enroot/pyxis multinode nvidia hpc-benchmark #88

Closed Concluant closed 1 year ago

Concluant commented 1 year ago

Hi team. I have problem with run multinode nvidia hpc-benchmark to cluster use pixys/enroot

cat /etc/os-release 
NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

I install all nodes: pmix-4.2.0 slurm-22.05.3 (with pmix plugin) openmpi-4.1.3 (with-slurm, with-pmix) nvslurm-plugin-pyxis-0.12.0 enroot-3.4.0-2 end, i use container hpc-benchmarks:20.10-hpl

enroot.conf

ENROOT_CACHE_PATH          /admin_work/HPL/enroot-cache/$(id -u)
ENROOT_DATA_PATH           //dmin_work/HPL/enroot/$(id -u)
[root@pgp0201 container_HPLGPU]# ll /etc/enroot/hooks.d/
total 40
-rwxr-xr-x 1 root root  900 Nov 13  2021 10-aptfix.sh
-rwxr-xr-x 1 root root 1237 Nov 13  2021 10-cgroups.sh
-rwxr-xr-x 1 root root 2465 Nov 13  2021 10-devices.sh
-rwxr-xr-x 1 root root 1193 Nov 13  2021 10-home.sh
-rwxr-xr-x 1 root root 3597 Nov 13  2021 10-shadow.sh
-rwxr-xr-x 1 root root 2720 Aug 24 14:55 50-slurm-pmi.sh
-rwxr-xr-x 1 root root 3207 Nov 13  2021 98-nvidia.sh
-rwxr-xr-x 1 root root 7998 Nov 13  2021 99-mellanox.sh

openmpi and pmix

[root@pgp0201 container_HPLGPU]# ll /usr/local
total 0
.....
drwxr-xr-x  7 root root  67 Aug 24 10:11 openmpi4
drwxr-xr-x  7 root root  67 Aug 24 09:56 pmix
......

ENV to batch (/admin_work - it's NFS share)

CONT=/admin_work/HPL/nvidia+hpc-benchmarks+20.10-hpl.sqsh
MOUNT=/admin_work/HPL/:/home_pwd

if i start container to any node - it's work normal

srun -N 1 --nodelist=pgp0201  --export MELLANOX_VISIBLE_DEVICES="none" --cpu-bind=none --mpi=pmix --container-image="${CONT}" --container-mounts="${MOUNT}" --pty bash
or
srun -N 1 --nodelist=pgp0202  --export MELLANOX_VISIBLE_DEVICES="none" --cpu-bind=none --mpi=pmix --container-image="${CONT}" --container-mounts="${MOUNT}" --pty bash

if i start mpirun batch to container 1 (any) node - it's worked.

srun -N 1 --nodelist=pgp0201 --export MELLANOX_VISIBLE_DEVICES="none" --ntasks-per-node=2 --cpu-bind=none --mpi=pmix --container-image="${CONT}" --container-mounts="${MOUNT}" mpirun -np 2 /workspace/hpl-linux-x86_64/hpl.sh --dat /home_pwd/HPL-2-p100-1N.dat --cpu-affinity 0:1 --cpu-cores-per-rank 10 --gpu-affinity 0:1

but i have problem with mpi multinode srun command if i use mpi=mpix

srun -N 2 --nodelist=pgp020[1-2] --export MELLANOX_VISIBLE_DEVICES="none" --ntasks-per-node=2 --cpu-bind=none --mpi=pmix --container-image="${CONT}" --container-mounts="${MOUNT}" mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 4 /workspace/hpl-linux-x86_64/hpl.sh --dat /home_pwd/HPL-4-p100-2N.dat --cpu-affinity 0:1 --cpu-cores-per-rank 10 --gpu-affinity 0:1

error:

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIX stopped checking at the first component that it did not find.

Host:      pgp0202
Framework: psec
Component: munge

It looks like pmix_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during pmix_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
PMIX developer):

  pmix_psec_base_open failed
  --> Returned value -46 instead of PMIX_SUCCESS

[pgp0202:12255] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at line 229

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIX stopped checking at the first component that it did not find.

Host:      pgp0202
Framework: psec
Component: munge

It looks like pmix_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during pmix_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
PMIX developer):

  pmix_psec_base_open failed
  --> Returned value -46 instead of PMIX_SUCCESS

[pgp0202:12256] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at line 229

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIX stopped checking at the first component that it did not find.

Host:      pgp0201
Framework: psec
Component: munge

It looks like pmix_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during pmix_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
PMIX developer):

  pmix_psec_base_open failed
  --> Returned value -46 instead of PMIX_SUCCESS

[pgp0201:12813] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at line 229

A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
PMIX stopped checking at the first component that it did not find.

Host:      pgp0201
Framework: psec
Component: munge

It looks like pmix_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during pmix_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
PMIX developer):

  pmix_psec_base_open failed
  --> Returned value -46 instead of PMIX_SUCCESS

[pgp0201:12812] PMIX ERROR: NOT-FOUND in file server/pmix_server.c at line 229
srun: error: pgp0202: tasks 2-3: Exited with exit code 1
srun: error: pgp0201: tasks 0-1: Exited with exit code 1

if i use pmi2

srun -N 2 --nodelist=pgp020[1-2] --export MELLANOX_VISIBLE_DEVICES="none" --ntasks-per-node=2 --cpu-bind=none --mpi=pmi2 --container-image="${CONT}" --container-mounts="${MOUNT}" mpirun --mca btl smcuda,self -x UCX_TLS=sm,cuda,cuda_copy,cuda_ipc -np 4 /workspace/hpl-linux-x86_64/hpl.sh --dat /home_pwd/HPL-4-p100-2N.dat --cpu-affinity 0:1 --cpu-cores-per-rank 10 --gpu-affinity 0:1

error

The SLURM process starter for OpenMPI was unable to locate a
usable "srun" command in its path. Please check your path
and try again.

An internal error has occurred in ORTE:

[[19480,0],0] FORCE-TERMINATE AT (null):1 - error plm_slurm_module.c(471)

This is something that should be reported to the developers.

The SLURM process starter for OpenMPI was unable to locate a
usable "srun" command in its path. Please check your path
and try again.

An internal error has occurred in ORTE:

[[19479,0],0] FORCE-TERMINATE AT (null):1 - error plm_slurm_module.c(471)

This is something that should be reported to the developers.

The SLURM process starter for OpenMPI was unable to locate a
usable "srun" command in its path. Please check your path
and try again.

An internal error has occurred in ORTE:

[[18697,0],0] FORCE-TERMINATE AT (null):1 - error plm_slurm_module.c(471)

This is something that should be reported to the developers.

The SLURM process starter for OpenMPI was unable to locate a
usable "srun" command in its path. Please check your path
and try again.

An internal error has occurred in ORTE:

[[18696,0],0] FORCE-TERMINATE AT (null):1 - error plm_slurm_module.c(471)

This is something that should be reported to the developers.

srun: error: pgp0202: tasks 2-3: Exited with exit code 1
srun: error: pgp0201: tasks 0-1: Exited with exit code 1

Do you can help with problem?

P.S. wiki to install pyxis miss urgent moment

echo 'include /etc/slurm/plugstack.conf.d/*' > /etc/slurm/plugstack.conf

Concluant commented 1 year ago
  1. add PMIx support to slurmd.service /etc/default/slurmd
    PMIX_MCA_ptl=^usock
    PMIX_MCA_psec=none
    PMIX_SYSTEM_TMPDIR=/var/empty
    PMIX_MCA_gds=hash
  2. add hook to enroot with additional env:UCX_TLS (because eth and cuda use), from host lib,bin,include (mkl, pmix, openmpi), mca_orte, slurm_submit_dir.
    echo "OMPI_MCA_orte_launch_agent=enroot start ${ENROOT_ROOTFS##*/} orted" >> "${ENROOT_ENVIRON}"
    echo "PATH=/home_pwd:/workspace:/workspace/hpl-ai-linux-x86_64:/workspace/hpl-linux-x86_64:/opt/intel/oneapi/mkl/2022.1.0/bin/intel64:/usr/local/openmpi/bin:/usr/local/openmpi4/bin:/usr/local/pmix/bin:/usr/local/ucx/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/root/bin" >> "${ENROOT_ENVIRON}"
    echo "SLURM_SUBMIT_DIR=/usr/local/openmpi4/bin:/usr/local/openmpi/bin" >> "${ENROOT_ENVIRON}"
    echo "LD_LIBRARY_PATH=/opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin:/usr/local/pmix/lib:/usr/local/ucx/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/openmpi4/bin:/usr/local/openmpi/bin/:/opt/intel/oneapi/mkl/2022.1.0/lib/intel64" >> "${ENROOT_ENVIRON}"
    echo "UCX_TLS=tcp,cuda,cuda_copy,cuda_ipc" >> "${ENROOT_ENVIRON}"

    add gres.conf from GPU.

And worked!!!

I have identical perf from enroot batch (start enroot and start linpack) and pyxis batch (with slurm and pyxis).

Now, i have slow perf with multi-node work linpack. For example: 1 node = 7.8 TFLOPs 4 nodes = 16.8 TFLOPs

[root@pgp0201 enroot]# srun -N 4 --nodelist=pgp0201,pgp0202,pgp0207,pgp0208 --ntasks-per-node=2 --cpu-bind=none --mpi=pmix --container-image="${CONT}" --container-mounts="${MOUNT}" /workspace/hpl-linux-x86_64/hpl.sh --config /home_pwd/P100.sh --dat /home_pwd/HPL-4-p100-2N.dat
WARNING: could not determine rank
WARNING: could not determine rank
WARNING: could not determine rank
WARNING: could not determine rank
INFO: host=pgp0207 rank= lrank=0 cores=10 gpu=0 cpu=0 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=pgp0208 rank= lrank=0 cores=10 gpu=0 cpu=0 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=pgp0207 rank= lrank=1 cores=10 gpu=1 cpu=1 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=pgp0208 rank= lrank=1 cores=10 gpu=1 cpu=1 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
WARNING: could not determine rank
WARNING: could not determine rank
WARNING: could not determine rank
INFO: host=pgp0202 rank= lrank=1 cores=10 gpu=1 cpu=1 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=pgp0202 rank= lrank=0 cores=10 gpu=0 cpu=0 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
WARNING: could not determine rank
INFO: host=pgp0201 rank= lrank=1 cores=10 gpu=1 cpu=1 ucx= bin=/workspace/hpl-linux-x86_64/xhpl
INFO: host=pgp0201 rank= lrank=0 cores=10 gpu=0 cpu=0 ucx= bin=/workspace/hpl-linux-x86_64/xhpl

================================================================================
HPL-NVIDIA 1.0.0  -- NVIDIA accelerated HPL benchmark -- NVIDIA
================================================================================
HPLinpack 2.1  --  High-Performance Linpack benchmark  --   October 26, 2012
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR02C2C8      210000   896     2     4             369.92              1.669e+04 
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0025541 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.

I think it's mpi-pmix problem because network not full utilize (only 4Gbps each node).

flx42 commented 1 year ago

Yes, you should not use mpirun combined with PMIx, it will not work. Regarding the low performance, it could be a misconfiguration but it's unlikely to be related to PMIx or pyxis, so I'm going to close this issue.