NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

Container does not start on a small set on cluster #122

Open lidavid88 opened 10 months ago

lidavid88 commented 10 months ago

I have discovered a problem on running container on a cluster.

I am using a nvidia pytorch container created with enroot in the following submit script:

#!/usr/bin/env bash

#SBATCH --time=03:00:00
#SBATCH --gres=gpu:4
#SBATCH --nodes=8
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=501600mb

ml purge

run_num='0'

SRUN_PARAMS=(
  --mpi="pmi2"
  --gpus-per-task=1
  --gpu-bind="closest"
  --label
  --container-name=fcn
  --container-mounts=/etc/slurm/task_prolog.hk:/etc/slurm/task_prolog.hk,/scratch:/scratch,/hkfs/work/workspace/scratch/usr1234,/tmp,/usr/bin/srun:/usr/bin/srun
  --container-mount-home
  --container-writable
  --no-container-entrypoint
)

srun "${SRUN_PARAMS[@]}" bash -c "
  echo $run_num
"

On most nodes srun is executed and I get 0 printed to the log.

But on the other nodes I get 2 types of errors:

1.

22: slurmstepd: error: pyxis: container start failed with error code: 1
22: slurmstepd: error: pyxis: printing enroot log file:
22: slurmstepd: error: pyxis:     /etc/enroot/hooks.d/10-shadow.sh: line 70: 3474706 Broken pipe             yes 2> /dev/null
22: slurmstepd: error: pyxis:          3474707 Segmentation fault      (core dumped) | pwck -R "${ENROOT_ROOTFS}" "${pwddb#${ENROOT_ROOTFS}}" /etc/shadow > /dev/null 2>&1
22: slurmstepd: error: pyxis:     nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig failed with error code: 1
22: slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
22: slurmstepd: error: pyxis: couldn't start container
22: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
22: slurmstepd: error: Failed to invoke spank plugin stack

2.

21: slurmstepd: error: pyxis: couldn't start container
21: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
21: slurmstepd: error: Failed to invoke spank plugin stack

This error does not appear, if I only use up to 4 nodes.

With 8 nodes the job works, if I am lucky. But most of the time I get errors on some nodes.

My guess is that the inter node communication is having troubles with pyxis.

Can someone help me with that?

Regards

flx42 commented 10 months ago

It the issue happens 0% of the time on some nodes and 100% of the time on some nodes, I suggest you start investigating the differences between the good nodes and the bad nodes:

lidavid88 commented 10 months ago

They seem to have the same versions and drivers.