Open sandwichdoge opened 2 months ago
yes, you should add env 'NVIDIA_VISIBLE_DEVICES=none' to this container. please refer to this issue https://github.com/Project-HAMi/HAMi/issues/464
@archlitchi Thanks for the reply. If the pod user overrides this env var, will they still be able to see all the GPUs? I'm working with a low-trust environment where the pod user can only use their own allocated VRAM.
I'm aware there's an option to prevent pod user from overriding env vars:
sudo vi /etc/nvidia-container-runtime/config.toml
# need these lines:
accept-nvidia-visible-devices-as-volume-mounts = true
accept-nvidia-visible-devices-envvar-when-unprivileged = false
However, enabling these lines causes the pods with allocated GPUs to crash with error:
NAME READY STATUS RESTARTS AGE
gpus-5bcbc4d55b-zkcsz 0/1 CrashLoopBackOff 1 (2s ago) 5s
kubectl -n 09e5313f-659a-499a-9085-e600df6ea705 logs -f gpus-5bcbc4d55b-zkcsz
Defaulted container "gpus" out of: gpus, init-chown-data (init)
tini: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or director
@archlitchi Thanks for the reply. If the pod user overrides this env var, will they still be able to see all the GPUs? I'm working with a low-trust environment where the pod user can only use their own allocated VRAM.
I'm aware there's an option to prevent pod user from overriding env vars:
sudo vi /etc/nvidia-container-runtime/config.toml # need these lines: accept-nvidia-visible-devices-as-volume-mounts = true accept-nvidia-visible-devices-envvar-when-unprivileged = false
However, enabling these lines causes the pods with allocated GPUs to crash with error:
NAME READY STATUS RESTARTS AGE gpus-5bcbc4d55b-zkcsz 0/1 CrashLoopBackOff 1 (2s ago) 5s kubectl -n 09e5313f-659a-499a-9085-e600df6ea705 logs -f gpus-5bcbc4d55b-zkcsz Defaulted container "gpus" out of: gpus, init-chown-data (init) tini: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or director
indeed, if you enabling these lines, device-plugin will not be working properly, because it needs to set 'NVIDIA_VISIBLE_DEVICES' in order to assign GPUs to pods. also, user can directly patch these env into the image, which couldn't be discovered by these line. best-practice is to add a mutating-webhook-configuration for each pod, to add 'NVIDIA_VISIBLE_DEVICES=none' to each contaienr.
Hello, I'd like to create a k8s deployment without GPUs, however my nvidia.com/gpu config doesn't work:
I confirmed that requesting 1 or more GPUs with a certain amount of VRAM works. But requesting "0" simply exposes all the GPUs running on that worker node to the running pod(s):
Here's my full deployment.yaml file:
Any pointers? Thank you.