NVIDIA / libnvidia-container

NVIDIA container runtime library
Apache License 2.0
815 stars 200 forks source link

[ERROR]nvidia-container-cli: detection error: path error: ///usr/local/nvidia/lib64/libvdpau_nvidia.so: permission denied" #178

Open nickkchenn opened 2 years ago

nickkchenn commented 2 years ago

[Backgroud]

I installed nvidia-container-toolkit-1.8.1-1.x86_64 and nvidia-container-runtime-3.8.1-1.noarch on a gpu node. I installed my nvidia driver using a user id 1000 on the node and edit the default docker runtime to nvidia runtime

I noticed that the container process run with default root user id 0, so I

after this , I tried to call nvidia-container-cli through command line ,it worked well

image

I expected to deploy my pod with nvidia-runtime successfully,but still came across this error while creating the container

image

Error: failed to start container "gputest-model": Error response from daemon: OCI runtime create failed: container_linux.go:330: starting container process caused "process_linux.go:381: container init caused "process_linux.go:364: running prestart hook 0:/usr/bin/nvidia-container-runtime-hook,/usr/bin/nvidia-container-runtime-hook,prestart, ContainerID: gputest-model caused "error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: detection error: path error: ///usr/local/nvidia/lib64/libvdpau_nvidia.so: permission denied\n""": unknown

image

I think this is a issue about owner groups , I'm confused why it just won't work while creating the container

[more details] I used strace to get more stack details strace -o output.log -F -f -T -tt -e trace=all nvidia-container-cli --user=1000:1000 info

I found out that it failed when reading soft links ,and I noticed there are many geteuid operation with different results during the process

image

although I have set the user to 1000:1000 through config, it still get id 0 a lot , I wonder if that's a correct behavior

Is it not ok to user nvidia-contain-toolkit with other user besides root ?

nickkchenn commented 2 years ago

I am using user id 1000 in my container , so if I set my nvidia driver owner to root, I will have no access to the driver files inside the container.

And I am also not allowed to set the permisson of driver files on the host larger than 750 due to some security requirements.

so I am trying to use gpu in my container with user id 1000 while my driver files permission is 750 under user 1000.

I thought the nvidia-container-cli supports me to set user=1000:1000 , it should work

klueska commented 2 years ago

Hi @nickkchenn

I'm still trying to understand a bit what your setup is and what your expectations are.

Initial questions:

  1. What do you mean when you say you "installed your nvidia driver using a user id 1000". Did you use a .run file to install the driver? What flags did you pass it?
  2. From one of your screenshots it looks like you are running under kubernetes. Have you tried running with standalone docker or containerd? Do you have the same issue?
  3. What is your exact expectation in setting user to 1000:1000 in /etc/nvidia-container-runtime/config.toml? The binary nvidia-container-cli will always be run as whatever user the parent process that invokes it is run as (i.e. containerd in your case most likely). The user field in /etc/nvidia-container-runtime/config.toml only indicates which user/group a child process of the nvidia-container-cli will have that makes calls out through NVML.

You also mention something about:

I am using user id 1000 in my container , so if I set my nvidia driver owner to root, I will have no access to the driver files inside the container.

This seems strange and unexpected. The expectation is that the driver is always installed as root on the host and that any user inside the container can make use of it (just like on the host itself).

I'm also confused about the path /usr/local/nvidia/ for some of your driver files. That is definitely not the default location for them, so I'm curious what customisations you made when installing the driver to put them there (and why)?.

nickkchenn commented 2 years ago

Thank you for answering me ,I‘ll try to explain my situation

  1. What do you mean when you say you "installed your nvidia driver using a user id 1000". Did you use a .run file to install the driver? What flags did you pass it?

I installed my driver using a .run file , I pass it the install path "/usr/local/nvidia" I changed the owner from root:root to paas:paas(1000:1000) manually after installation

2. From one of your screenshots it looks like you are running under kubernetes. Have you tried running with standalone docker or containerd? Do you have the same issue?

I am running under kubernetes I haven't tried running with standalone docker yet ,I'll take a try tomorrow soon I just tried with command line and thought it should be similar

3. What is your exact expectation in setting user to 1000:1000 in /etc/nvidia-container-runtime/config.toml?

I knew the container-toolkit will call container-cli during prestart phase I didn't set the user at first ,and it falied with "load library failed ,libnvidia-ml.so.1 cannot open ......" I knew the libnvidia-ml.so.1 was in the driver path and under group 1000:1000 so I set the user to '1000:1000' in the config.toml I thought it tell the container-cli to work under 1000:1000 than it would have access to all these so files

3. The binary nvidia-container-cli will always be run as whatever user the parent process that invokes it is run as (i.e. containerd in your case most likely). The user field in /etc/nvidia-container-runtime/config.toml only indicates which user/group a child process of the nvidia-container-cli will have that makes calls out through NVML.

I know now it won't work as I expected after your explanation

This seems strange and unexpected. The expectation is that the driver is always installed as root on the host and that any user inside the container can make use of it (just like on the host itself).

the product I am working on has some security constraints

we are not using root as default user , our own program process works under user 1000 so if I just install the driver as root , I won't have access to it in my process and we can't set the permission of files on the host larger than 750 , like 755 so we choose to change the ownership

I'm also confused about the path /usr/local/nvidia/ for some of your driver files. That is definitely not the default location for them, so I'm curious what customisations you made when installing the driver to put them there (and why)?.

there seemed to be no particular reason to change the install path , I think it can be changed if neccessary

nickkchenn commented 2 years ago

Our service has been using gpu for a while without nvidia-container-runtime

we install the nvidia-driver and cuda on the host ,and mounted those files into the container with user 1000

we used the plugin image from https://github.com/GoogleCloudPlatform/container-engine-accelerators

this plugin binary doesn't require nvidia runtime,so we didn't installed nvidia-container-toolkit or runtime before

And now we notice the container-engine-accelerators is not an official product of Google

and we want to choose the mainstream approach of using gpu

so I'm trying to install nvidia-container toolkit and runtime to make it work in our case

And due to our limitation ,our nvidia driver is under 1000:1000

after your explanation of working mechanism the "user config" , I think it seemed not possible for me to use the cli under non-root installation of nvidia driver

is there a proper way for me to make it work?

nickkchenn commented 2 years ago

the installation command of my gpu driver is this:

sh ${NVIDIA_INSTALLER_RUNFILE} \ --utility-prefix=/usr/local/nvidia \ --utility-libdir="lib64" \ --opengl-prefix=/usr/local/nvidia \ --opengl-libdir="lib64" \ --compat32-prefix=/usr/local/nvidia \ --no-drm \ --no-install-compat32-libs \ --silent \ --kernel-source-path=/usr/src/kernels/"$(uname -r)" \ --accept-license \ --tmpdir $TEMP_DIR