Closed justin-yan closed 2 months ago
You have to set NVIDIA_VISIBLE_DEVICES
(in your srun
env, enroot env file, container, etc).
See https://github.com/NVIDIA/enroot/blob/master/doc/standard-hooks.md#98-nvidiash
Just wanted to confirm in case anyone else runs into this in the future, adding
export NVIDIA_VISIBLE_DEVICES='all'
export NVIDIA_DRIVER_CAPABILITIES='compute,utility'
srun ...
--export='ENROOT_CONFIG_PATH,NVIDIA_VISIBLE_DEVICES,NVIDIA_DRIVER_CAPABILITIES'
Resolved my issue! Did a bit of experimenting to figure out where I needed the env variables for enroot/pyxis/slurm, and it appears that it's --export
that's critical, and not --container-env
I have a few images built with a fairly small footprint: ubuntu, and a python toolchain. At runtime, I then install torch, etc. inside of the container, and use the CUDA runtime that torch brings in as a dependency.
When I start this container using docker on a host with nvidia-driver/nctk installed, and using the
--gpus
flag, I'm able to see the GPUs from inside -torch.cuda.is_available()
will return true.However, when I try to run this container with enroot/pyxis:
I get the following error:
I just wanted to ask if I should be able to use base images in this way and still be able to use GPUs (does NCTK do something that enroot isn't able to?), or whether I need to use nvidia-built base images? (or if there is documentation I've missed about what I need to put in a container in order to use GPUs with enroot?)
Thanks!