NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

slurmstepd: error: pyxis: [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1 #101

Closed infokng closed 1 year ago

infokng commented 1 year ago

Hi Folks

Trying to run below command

cd /cm/shared/mxnet && sbatch --export=ALL,DATA_SRC_DIR="/mnt/Cosmo-Small",DATA_DST_DIR="/mnt/processed",NUM_PROC=32 -N4 -n32 init_datasets.sub
269          mlperf-hp+       defq       root         64    RUNNING      0:0 
269.batch         batch                  root         32    RUNNING      0:0 
269.0              bash                  root         64    RUNNING      0:0 

Slurm Job Running 
269          mlperf-hp+       defq       root         64    RUNNING      0:0 
269.batch         batch                  root         32    RUNNING      0:0 
269.0              bash                  root         64    RUNNING      0:0 

Slurm Job Running 
269          mlperf-hp+       defq       root         64    RUNNING      0:0 
269.batch         batch                  root         32    RUNNING      0:0 
269.0              bash                  root         64    RUNNING      0:0 

Slurm Job Running 
269          mlperf-hp+       defq       root         64    RUNNING      0:0 
269.batch         batch                  root         32    RUNNING      0:0 
269.0              bash                  root         64     FAILED      1:0 
269.1              bash                  root         64    RUNNING      0:0 

Below is the error trace

[root@bright88 mxnet]# tail -f slurm-268.out

pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     /usr/bin/enroot: line 44: HOME: unbound variable
slurmstepd: error: pyxis:     [ERROR] Command not found: nvidia-container-cli, see https://github.com/NVIDIA/libnvidia-container
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
flx42 commented 1 year ago

Hi,

You need to follow the instructions here: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html

infokng commented 1 year ago

@flx42 My setup of slurm + pyxis + enroot was working fine till I reinstalled all the mentioned components . My nvidia-docker image runs fine with sbatch when I add the flag --export=NVIDIA_VISIBLE_DEVICES=void

The link you shared is installing nvidia-container-cli with docker , does pyxis firing enroot containers depend on setting nvidia-container-cli on host machine ?

infokng commented 1 year ago

@flx42

[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no" -N 1 -G 4 --ntasks-per-node=4 --gpu-bind=none --gpus-per-task=1 --exclusive --mpi=pmix_v3 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh

pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     /usr/bin/enroot: line 44: HOME: unbound variable
slurmstepd: error: pyxis:     [ERROR] Command not found: nvidia-container-cli, see https://github.com/NVIDIA/libnvidia-container
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: node001: tasks 0-3: Exited with exit code 1
[root@bright88 mxnet]# srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no,NVIDIA_VISIBLE_DEVICES=1" -N 1 -G 4 --ntasks-per-node=4 --gpu-bind=none --gpus-per-task=1 --exclusive --mpi=pmix_v3 -w node001 --container-image=192.168.61.4:5000#/cosmoflow-nvidia:0.4 --container-name=mlperf-hpc-cosmoflow --container-mounts=/mnt/lustre/processed:/data:ro,/mnt/lustre/results:/results,/tmp/:/staging_area bash ./run_and_time.sh

pyxis: imported docker image: 192.168.61.4:5000#/cosmoflow-nvidia:0.4
slurmstepd: error: pyxis: container start failed with error code: 1
slurmstepd: error: pyxis: printing enroot log file:
slurmstepd: error: pyxis:     /usr/bin/enroot: line 44: HOME: unbound variable
slurmstepd: error: pyxis:     [ERROR] Command not found: nvidia-container-cli, see https://github.com/NVIDIA/libnvidia-container
slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
slurmstepd: error: pyxis: couldn't start container
slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
slurmstepd: error: Failed to invoke spank plugin stack
srun: error: node001: tasks 0-3: Exited with exit code 1
flx42 commented 1 year ago

Yes you need libnvidia-container installed, you can install the package nvidia-container-toolkit from the repository.

srun --export="NCCL_DEBUG=INFO,NCCL_IB_DISABLE=1,PMIX_MCA_gds=hash,UCX_IB_MLX5_DEVX=no"

This is different than docker -e, this unsets all other environment variables not specified in this list. Try removing this as the first step.

flx42 commented 1 year ago

@infokng are you good now?