NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
648 stars 94 forks source link

is 50-slurm-pmi.sh still needed with pmix_v3 ? #136

Closed rvencu closed 2 years ago

rvencu commented 2 years ago

I have an installation where slurm is compiled with openmpi and pmix_v3

I noticed if I try to add the optional 50-slurm-pmi.sh hook file, the containers break with error saying they cannot find the scontrol command

mapping slurm folders and adding bin and sbin to path do not help. removing the file from hooks.d resolve things.

3XX0 commented 2 years ago

If you use the PMI hook you need scontrol to be available on the compute nodes. It needs to be available in the PATH given to enroot (i.e. the one from slurmd/slurmstepd). You can check with something like srun sh -c 'command -v scontrol'

rvencu commented 2 years ago

trying to run that, it fails. the scontrol is at /opt/slurm/bin/

#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --job-name=nccl-tests
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
#SBATCH --exclusive
#SBATCH --comment=stability
#SBATCH --output=%x_%j.out
module load openmpi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nccl/build/lib:/opt/aws-ofi-nccl/lib:/opt/amazon/openmpi/lib
export PATH=$PATH:/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:/opt/slurm/bin:/opt/slurm/sbin
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4dn
export FI_EFA_ENABLE_SHM_TRANSFER=0
export FI_PROVIDER=efa
export FI_EFA_TX_MIN_CREDITS=64
export NCCL_DEBUG=warn
export NCCL_PROTO=simple
export NCCL_TREE_THRESHOLD=0
export OMPI_MCA_mtl_base_verbose=1
export OMPI_MCA_btl="^openib"
export OMPI_DIR=/opt/amazon/openmpi
export PMIX_MCA_gds=hash

srun --comment stability --container-image=public.ecr.aws\#w6p6i9i7/aws-efa-nccl-rdma:base-cudnn8-cuda11.3-ubuntu20.04 \
        --container-mounts=/opt/slurm:/opt/slurm/ --prolog /opt/slurm/sbin/prolog.sh /opt/nccl-tests/build/all_reduce_perf -b 128M -e 8G -f 2 -g 1 -c 1 -n 20
flx42 commented 2 years ago

Why do you have module load openmpi if you are running code from inside a container?

rvencu commented 2 years ago

deleted that, I think it is not relevant. still doing this

pyxis: imported docker image: public.ecr.aws#w6p6i9i7/aws-efa-nccl-rdma:base-cudnn8-cuda11.3-ubuntu20.04
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     [ERROR] Command not found: scontrol
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/50-slurm-pmi.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
3XX0 commented 2 years ago

You most likely need to add /opt/slurm/bin to the PATH of the slurmd systemd service (or other init system)

rvencu commented 2 years ago

solved like this:

  1. check command:

    # pgrep slurmd | xargs -i grep -zanH PATH /proc/{}/environ
    /proc/13957/environ:2:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
  2. fix command:

    pssh -h hostsfile -i "(echo 'PATH=/opt/slurm/sbin:/opt/slurm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin' | sudo tee -a /etc/sysconfig/slurmd) && sudo systemctl restart slurmd"