srun container (system PATH) issue?

xihajun commented 1 year ago

We have a docker image: mlperf-nvidia:language_model zipped in a .tar file

When we use image itself, all the scripts work fine. Eg. python, export, cat, nvidia-smi

However, when we run it via srun, everything is not working as expected. So we tried to manually add the file, but we still get errors: srun --ntasks=1 --container-image=./out.squashfs --container-name=language_model --container-writable --container-workdir=/workspace/bert --container-mounts=./nvidia-smi:/usr/bin/nvidia-smi nvidia-smi

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Any suggestions? (one thought is that docker export removed some necessary dependencies)

flx42 commented 1 year ago

docker export probably removed the environment variables NVIDIA_DRIVER_CAPABILITIES and NVIDIA_VISIBLE_DEVICES.

Try NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility srun ..., and remove the bind-mount of nvidia-smi as it will be handled by the enroot hook https://github.com/NVIDIA/enroot/blob/v3.4.0/conf/hooks/98-nvidia.sh once those environment variables are set.

xihajun commented 1 year ago

docker export probably removed the environment variables NVIDIA_DRIVER_CAPABILITIES and NVIDIA_VISIBLE_DEVICES.

Try NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility srun ..., and remove the bind-mount of nvidia-smi as it will be handled by the enroot hook https://github.com/NVIDIA/enroot/blob/v3.4.0/conf/hooks/98-nvidia.sh once those environment variables are set.

Thank you so much!

NVIDIA / pyxis

srun container (system PATH) issue? #96