Open plopresti opened 4 years ago
Thanks for taking the time to thoroughly debug the issue. I will need to think about the proper fix for this, but your description makes it clear what the issue is at least.
@klueska Any thoughts on this? I can take a crack at a fix if you can describe what the proper fix is...
Sorry for the delayed response. And thanks for the detailed description / debugging of the issue.
I'll need to dig into this a bit to see what the right fix is. In the meantime, are you OK running on your hacked version, or do you need something more stable / official?
I wound up using Singularity instead, since it already solves all of the problems I was trying to solve with rootless Podman.
Thanks!
For future reference, it looks like a workaround for podman is described here: https://github.com/containers/podman/issues/3659
Yeah, I saw that. But this problem is specific to rootless podman with the "ignore_chown_errors" option enabled (which maps everything to UID 0 inside the container).
Got it. Thanks for clarifying.
Related: https://github.com/NVIDIA/nvidia-container-runtime/issues/85
My libnvidia-container version is 1.2.0.
I am using rootless podman on RHEL (CentOS) 7.8, trying out the new-ish ignore_chown_errors option. This mode maps all users inside the container to my own user id, avoiding the nuisances of UID maps (newuidmap, /etc/subuid, etc.)
I followed all of the recommendations at the "related" link above; specifically, I edited config.toml to set no-cgroups to true and the debug path to something in my home directory.
But I get the following error:
Running under
strace -f
is informative:Note that PID 3742 is the main nvidia-container-cli process and PID 3747 is the "driver" sub-process. The driver sub-process is trying to call setgroups(), which is failing with EPERM, causing the sub-process to exit and the main process to exit with an error.
The only call to setgroups() in the source code is here:
https://github.com/NVIDIA/libnvidia-container/blob/e6e1c4860d9694608217737c31fc844ef8b9dfd7/src/utils.c#L918
...which is in perm_drop_privileges().
So I commented out the body of perm_drop_privileges(), replaced it with "return 0", recompiled, and installed the hacked libnvidia-container.so.1. And now it works!
Obviously this is not the right fix, and I do not know enough to say what the right fix is. But when I am running rootless podman with all container UIDs mapped to myself, I actually want the container's processes to retain all of my privileges on the host. Perhaps an option in config.toml to skip perm_drop_privileges (?)