Closed rvencu closed 2 years ago
If you use the PMI hook you need scontrol
to be available on the compute nodes.
It needs to be available in the PATH
given to enroot (i.e. the one from slurmd
/slurmstepd
).
You can check with something like srun sh -c 'command -v scontrol'
trying to run that, it fails. the scontrol is at /opt/slurm/bin/
#!/bin/bash
#SBATCH --partition=gpu
#SBATCH --job-name=nccl-tests
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1
#SBATCH --exclusive
#SBATCH --comment=stability
#SBATCH --output=%x_%j.out
module load openmpi
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/nccl/build/lib:/opt/aws-ofi-nccl/lib:/opt/amazon/openmpi/lib
export PATH=$PATH:/opt/amazon/efa/bin:/opt/amazon/openmpi/bin:/opt/slurm/bin:/opt/slurm/sbin
export FI_EFA_FORK_SAFE=1
export FI_LOG_LEVEL=1
export FI_EFA_USE_DEVICE_RDMA=1 # use for p4dn
export FI_EFA_ENABLE_SHM_TRANSFER=0
export FI_PROVIDER=efa
export FI_EFA_TX_MIN_CREDITS=64
export NCCL_DEBUG=warn
export NCCL_PROTO=simple
export NCCL_TREE_THRESHOLD=0
export OMPI_MCA_mtl_base_verbose=1
export OMPI_MCA_btl="^openib"
export OMPI_DIR=/opt/amazon/openmpi
export PMIX_MCA_gds=hash
srun --comment stability --container-image=public.ecr.aws\#w6p6i9i7/aws-efa-nccl-rdma:base-cudnn8-cuda11.3-ubuntu20.04 \
--container-mounts=/opt/slurm:/opt/slurm/ --prolog /opt/slurm/sbin/prolog.sh /opt/nccl-tests/build/all_reduce_perf -b 128M -e 8G -f 2 -g 1 -c 1 -n 20
Why do you have module load openmpi
if you are running code from inside a container?
deleted that, I think it is not relevant. still doing this
pyxis: imported docker image: public.ecr.aws#w6p6i9i7/aws-efa-nccl-rdma:base-cudnn8-cuda11.3-ubuntu20.04
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: [ERROR] Command not found: scontrol
slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/50-slurm-pmi.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
You most likely need to add /opt/slurm/bin
to the PATH of the slurmd systemd service (or other init system)
solved like this:
check command:
# pgrep slurmd | xargs -i grep -zanH PATH /proc/{}/environ
/proc/13957/environ:2:PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
fix command:
pssh -h hostsfile -i "(echo 'PATH=/opt/slurm/sbin:/opt/slurm/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin' | sudo tee -a /etc/sysconfig/slurmd) && sudo systemctl restart slurmd"
I have an installation where slurm is compiled with openmpi and pmix_v3
I noticed if I try to add the optional 50-slurm-pmi.sh hook file, the containers break with error saying they cannot find the scontrol command
mapping slurm folders and adding bin and sbin to path do not help. removing the file from hooks.d resolve things.