NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
626 stars 94 forks source link

99-mellanox.sh script broken when there is no "/sys/class/infiniband_cm" #118

Open jasonguy opened 2 years ago

jasonguy commented 2 years ago

The script 99-mellanox.sh breaks on hosts with newer linux-rdma package. It looks like this commit in the linux kernel removed all references to the /sys/class/infiniband_cm/ directory represented by the cm_class symbol.

flx42 commented 2 years ago

Which version of enroot are you using? This should have been fixed by https://github.com/NVIDIA/enroot/commit/d80b3f6749bd796d05c136bb30788e7acd63afd6

jasonguy commented 2 years ago

Looks like the user (not me) was following “Running with Pyxis/Enroot” in NVIDIA HPC-Benchmarks page. Command executed on the DGX was: root@p-dgx-a100-009:/home/secure# srun -N 1 --ntasks-per-node=8 --cpu-bind=none --mpi=none --container-image="/home/secure/hpl-baai-21.4.sqsh" hpl.sh --xhpl-ai --config dgx-a100 --dat /workspace/hpl-ai-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat

flx42 commented 2 years ago

Right, but this doesn't tell me which enroot version it is. You need to run:

$ enroot version
3.3.0
jasonguy commented 2 years ago

Looks like the answer is in the error log...

slurmstepd: error: pyxis: enroot-mount: failed to mount: /sys/class/infiniband_cm/abi_version at /root/.local/share/enroot/pyxis_19.0/sys/class/infiniband_cm/abi_version: No such file or directory slurmstepd: error: pyxis: [ERROR] /usr/local/etc/enroot/hooks.d/99-mellanox.sh exited with return code 1

flx42 commented 2 years ago

No, this is just the Slurm job ID.