NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
644 stars 94 forks source link

Enroot vs. NCTK/Docker for non-nvidia images #205

Closed justin-yan closed 2 months ago

justin-yan commented 2 months ago

I have a few images built with a fairly small footprint: ubuntu, and a python toolchain. At runtime, I then install torch, etc. inside of the container, and use the CUDA runtime that torch brings in as a dependency.

When I start this container using docker on a host with nvidia-driver/nctk installed, and using the --gpus flag, I'm able to see the GPUs from inside - torch.cuda.is_available() will return true.

However, when I try to run this container with enroot/pyxis:

srun \
  --nodes=1 --gpus=2 \
  --container-workdir=$(pwd) \
  --container-mount-home \
  --container-image="myimage" \
  --pty bash -i

I get the following error:

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

I just wanted to ask if I should be able to use base images in this way and still be able to use GPUs (does NCTK do something that enroot isn't able to?), or whether I need to use nvidia-built base images? (or if there is documentation I've missed about what I need to put in a container in order to use GPUs with enroot?)

Thanks!

3XX0 commented 2 months ago

You have to set NVIDIA_VISIBLE_DEVICES (in your srun env, enroot env file, container, etc). See https://github.com/NVIDIA/enroot/blob/master/doc/standard-hooks.md#98-nvidiash

justin-yan commented 2 months ago

Just wanted to confirm in case anyone else runs into this in the future, adding

export NVIDIA_VISIBLE_DEVICES='all'
export NVIDIA_DRIVER_CAPABILITIES='compute,utility'
srun ...
  --export='ENROOT_CONFIG_PATH,NVIDIA_VISIBLE_DEVICES,NVIDIA_DRIVER_CAPABILITIES'

Resolved my issue! Did a bit of experimenting to figure out where I needed the env variables for enroot/pyxis/slurm, and it appears that it's --export that's critical, and not --container-env