Closed xihajun closed 1 year ago
docker export
probably removed the environment variables NVIDIA_DRIVER_CAPABILITIES and NVIDIA_VISIBLE_DEVICES.
Try NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility srun ...
, and remove the bind-mount of nvidia-smi
as it will be handled by the enroot hook https://github.com/NVIDIA/enroot/blob/v3.4.0/conf/hooks/98-nvidia.sh once those environment variables are set.
docker export
probably removed the environment variables NVIDIA_DRIVER_CAPABILITIES and NVIDIA_VISIBLE_DEVICES.Try
NVIDIA_VISIBLE_DEVICES=all NVIDIA_DRIVER_CAPABILITIES=compute,utility srun ...
, and remove the bind-mount ofnvidia-smi
as it will be handled by the enroot hook https://github.com/NVIDIA/enroot/blob/v3.4.0/conf/hooks/98-nvidia.sh once those environment variables are set.
Thank you so much!
We have a docker image:
mlperf-nvidia:language_model
zipped in a.tar
fileWhen we use image itself, all the scripts work fine. Eg.
python
,export
,cat
,nvidia-smi
However, when we run it via srun, everything is not working as expected. So we tried to manually add the file, but we still get errors:
srun --ntasks=1 --container-image=./out.squashfs --container-name=language_model --container-writable --container-workdir=/workspace/bert --container-mounts=./nvidia-smi:/usr/bin/nvidia-smi nvidia-smi
Any suggestions? (one thought is that
docker export
removed some necessary dependencies)