Closed infokng closed 1 year ago
Hi,
You need to follow the instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
@flx42 My setup of slurm + pyxis + enroot was working fine till I reinstalled all the mentioned components . My nvidia-docker image runs fine with sbatch when I add the flag --export=NVIDIA_VISIBLE_DEVICES=void
The link you shared is installing nvidia-container-cli with docker , does pyxis firing enroot containers depend on setting nvidia-container-cli on host machine ?
@flx42
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no" -N 1 -G 4 --ntasks-per-node=4 --gpu-bind=none --gpus-per-task=1 --exclusive --mpi=pmix_v3 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: /usr/bin/enroot: line 44: HOME: unbound variable
slurmstepd: error: pyxis: [ERROR] Command not found: nvidia-container-cli, see https://github.com/NVIDIA/libnvidia-container
slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: node001: tasks 0-3: Exited with exit code 1
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,NVIDIA_VISIBLE_DEVICES=1" -N 1 -G 4 --ntasks-per-node=4 --gpu-bind=none --gpus-per-task=1 --exclusive --mpi=pmix_v3 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh
pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis: /usr/bin/enroot: line 44: HOME: unbound variable
slurmstepd: error: pyxis: [ERROR] Command not found: nvidia-container-cli, see https://github.com/NVIDIA/libnvidia-container
slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: node001: tasks 0-3: Exited with exit code 1
Yes you need libnvidia-container installed, you can install the package nvidia-container-toolkit
from the repository.
srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no"
This is different than docker -e
, this unsets all other environment variables not specified in this list. Try removing this as the first step.
@infokng are you good now?
Hi Folks
Trying to run below command
Below is the error trace