Open nickkchenn opened 2 years ago
I am using user id 1000 in my container , so if I set my nvidia driver owner to root, I will have no access to the driver files inside the container.
And I am also not allowed to set the permisson of driver files on the host larger than 750 due to some security requirements.
so I am trying to use gpu in my container with user id 1000 while my driver files permission is 750 under user 1000.
I thought the nvidia-container-cli supports me to set user=1000:1000 , it should work
Hi @nickkchenn
I'm still trying to understand a bit what your setup is and what your expectations are.
Initial questions:
.run
file to install the driver? What flags did you pass it?user
to 1000:1000
in /etc/nvidia-container-runtime/config.toml
? The binary nvidia-container-cli
will always be run as whatever user the parent process that invokes it is run as (i.e. containerd
in your case most likely). The user
field in /etc/nvidia-container-runtime/config.toml
only indicates which user/group a child process of the nvidia-container-cli
will have that makes calls out through NVML.You also mention something about:
I am using user id 1000 in my container , so if I set my nvidia driver owner to root, I will have no access to the driver files inside the container.
This seems strange and unexpected. The expectation is that the driver is always installed as root on the host and that any user inside the container can make use of it (just like on the host itself).
I'm also confused about the path /usr/local/nvidia/
for some of your driver files. That is definitely not the default location for them, so I'm curious what customisations you made when installing the driver to put them there (and why)?.
Thank you for answering me ,I‘ll try to explain my situation
- What do you mean when you say you "installed your nvidia driver using a user id 1000". Did you use a
.run
file to install the driver? What flags did you pass it?
I installed my driver using a .run file , I pass it the install path "/usr/local/nvidia" I changed the owner from root:root to paas:paas(1000:1000) manually after installation
2. From one of your screenshots it looks like you are running under kubernetes. Have you tried running with standalone docker or containerd? Do you have the same issue?
I am running under kubernetes I haven't tried running with standalone docker yet ,I'll take a try tomorrow soon I just tried with command line and thought it should be similar
3. What is your exact expectation in setting
user
to1000:1000
in/etc/nvidia-container-runtime/config.toml
?
I knew the container-toolkit will call container-cli during prestart phase I didn't set the user at first ,and it falied with "load library failed ,libnvidia-ml.so.1 cannot open ......" I knew the libnvidia-ml.so.1 was in the driver path and under group 1000:1000 so I set the user to '1000:1000' in the config.toml I thought it tell the container-cli to work under 1000:1000 than it would have access to all these so files
3. The binary
nvidia-container-cli
will always be run as whatever user the parent process that invokes it is run as (i.e.containerd
in your case most likely). Theuser
field in/etc/nvidia-container-runtime/config.toml
only indicates which user/group a child process of thenvidia-container-cli
will have that makes calls out through NVML.
I know now it won't work as I expected after your explanation
This seems strange and unexpected. The expectation is that the driver is always installed as root on the host and that any user inside the container can make use of it (just like on the host itself).
the product I am working on has some security constraints
we are not using root as default user , our own program process works under user 1000 so if I just install the driver as root , I won't have access to it in my process and we can't set the permission of files on the host larger than 750 , like 755 so we choose to change the ownership
I'm also confused about the path
/usr/local/nvidia/
for some of your driver files. That is definitely not the default location for them, so I'm curious what customisations you made when installing the driver to put them there (and why)?.
there seemed to be no particular reason to change the install path , I think it can be changed if neccessary
Our service has been using gpu for a while without nvidia-container-runtime
we install the nvidia-driver and cuda on the host ,and mounted those files into the container with user 1000
we used the plugin image from https://github.com/GoogleCloudPlatform/container-engine-accelerators
this plugin binary doesn't require nvidia runtime,so we didn't installed nvidia-container-toolkit or runtime before
And now we notice the container-engine-accelerators is not an official product of Google
and we want to choose the mainstream approach of using gpu
so I'm trying to install nvidia-container toolkit and runtime to make it work in our case
And due to our limitation ,our nvidia driver is under 1000:1000
after your explanation of working mechanism the "user config" , I think it seemed not possible for me to use the cli under non-root installation of nvidia driver
is there a proper way for me to make it work?
the installation command of my gpu driver is this:
sh ${NVIDIA_INSTALLER_RUNFILE} \ --utility-prefix=/usr/local/nvidia \ --utility-libdir="lib64" \ --opengl-prefix=/usr/local/nvidia \ --opengl-libdir="lib64" \ --compat32-prefix=/usr/local/nvidia \ --no-drm \ --no-install-compat32-libs \ --silent \ --kernel-source-path=/usr/src/kernels/"$(uname -r)" \ --accept-license \ --tmpdir $TEMP_DIR
[Backgroud]
I installed nvidia-container-toolkit-1.8.1-1.x86_64 and nvidia-container-runtime-3.8.1-1.noarch on a gpu node. I installed my nvidia driver using a user id 1000 on the node and edit the default docker runtime to nvidia runtime
I noticed that the container process run with default root user id 0, so I
after this , I tried to call nvidia-container-cli through command line ,it worked well
I expected to deploy my pod with nvidia-runtime successfully,but still came across this error while creating the container
I think this is a issue about owner groups , I'm confused why it just won't work while creating the container
[more details] I used strace to get more stack details
strace -o output.log -F -f -T -tt -e trace=all nvidia-container-cli --user=1000:1000 info
I found out that it failed when reading soft links ,and I noticed there are many geteuid operation with different results during the process
although I have set the user to 1000:1000 through config, it still get id 0 a lot , I wonder if that's a correct behavior
Is it not ok to user nvidia-contain-toolkit with other user besides root ?