Container does not start on a small set on cluster

lidavid88 commented 10 months ago

I have discovered a problem on running container on a cluster.

I am using a nvidia pytorch container created with enroot in the following submit script:

#!/usr/bin/env bash

#SBATCH --time=03:00:00
#SBATCH --gres=gpu:4
#SBATCH --nodes=8
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=4
#SBATCH --mem=501600mb

ml purge

run_num='0'

SRUN_PARAMS=(
  --mpi="pmi2"
  --gpus-per-task=1
  --gpu-bind="closest"
  --label
  --container-name=fcn
  --container-mounts=/etc/slurm/task_prolog.hk:/etc/slurm/task_prolog.hk,/scratch:/scratch,/hkfs/work/workspace/scratch/usr1234,/tmp,/usr/bin/srun:/usr/bin/srun
  --container-mount-home
  --container-writable
  --no-container-entrypoint
)

srun "${SRUN_PARAMS[@]}" bash -c "
  echo $run_num
"

On most nodes srun is executed and I get 0 printed to the log.

But on the other nodes I get 2 types of errors:

1.

22: slurmstepd: error: pyxis: container start failed with error code: 1
22: slurmstepd: error: pyxis: printing enroot log file:
22: slurmstepd: error: pyxis:     /etc/enroot/hooks.d/10-shadow.sh: line 70: 3474706 Broken pipe             yes 2> /dev/null
22: slurmstepd: error: pyxis:          3474707 Segmentation fault      (core dumped) | pwck -R "${ENROOT_ROOTFS}" "${pwddb#${ENROOT_ROOTFS}}" /etc/shadow > /dev/null 2>&1
22: slurmstepd: error: pyxis:     nvidia-container-cli: ldcache error: process /usr/sbin/ldconfig failed with error code: 1
22: slurmstepd: error: pyxis:     [ERROR] /etc/enroot/hooks.d/98-nvidia.sh exited with return code 1
22: slurmstepd: error: pyxis: couldn't start container
22: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
22: slurmstepd: error: Failed to invoke spank plugin stack

2.

21: slurmstepd: error: pyxis: couldn't start container
21: slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1
21: slurmstepd: error: Failed to invoke spank plugin stack

This error does not appear, if I only use up to 4 nodes.

With 8 nodes the job works, if I am lucky. But most of the time I get errors on some nodes.

My guess is that the inter node communication is having troubles with pyxis.

Can someone help me with that?

Regards

flx42 commented 10 months ago

It the issue happens 0% of the time on some nodes and 100% of the time on some nodes, I suggest you start investigating the differences between the good nodes and the bad nodes:

Is it the same distro, Linux version, NVIDIA driver version?
Is it the same enroot version? Perhaps try to reinstall enroot on the bad nodes.
Check dmesg and the slurmd log on the bad nodes for any clue.

lidavid88 commented 10 months ago

They seem to have the same versions and drivers.

NVIDIA / pyxis

Container does not start on a small set on cluster #122