Closed ich777 closed 1 year ago
Hi @ich777 thanks for the issue. After reading through the responses, it seems that the user has narrowed down the cause of the issue:
So I kept installing from there expanding my core group of containers to more and more based on how long I had been using them and vibes basically. I did five at a time and then I'd test for failure by starting/stopping over and over on the containers and docker service itself. I'd reboot the server just to check that too trying to force something to fail and give that error. Eventually, I worked my way down to about 10 containers. Krusader was added and immediately when I stopped Plex and tried to restart it blam-o! I got the error again. I uninstalled Krusader, did a clean reboot, and all was fine. Reinstalled Krusader and blam-o! again. I'm honestly not sure where to begin on that container and making it work. I rarely use it anyway, so may just keep it uninstalled for my sanity.
Looking at the repo for the Krusader image: https://github.com/binhex/arch-krusader/tree/master I don't see any explicit issues, but it could be that the container is modifying or locking /proc/sys/kernel/overflowuid
causing GPU containers which try to access this file through the NVIDIA prestart hook to fail.
Looking at the code that accesses this file:
if (cfg->uid != (uid_t)-1)
ctx->cfg.uid = cfg->uid;
else {
if (file_read_uint32(err, PROC_OVERFLOW_UID, &uid) < 0)
return (-1);
ctx->cfg.uid = (uid_t)uid;
}
if (cfg->gid != (gid_t)-1)
ctx->cfg.gid = cfg->gid;
else {
if (file_read_uint32(err, PROC_OVERFLOW_GID, &gid) < 0)
return (-1);
ctx->cfg.gid = (gid_t)gid;
}
It seems as if setting the user and group explicitly should address this issue.
If it's possible to update the nvidia-container-cli.user
setting in the /etc/nvidia-container-toolkit/config.toml
file to:
user = "65534:65534"
(the contents of /proc/sys/kernel/overflowuid
and /proc/sys/kernel/overflowgid
, respectively)
This should cause the nvidia-container-runtime-hook
to add a --user=65534:65534
flag and prevent the files from being accessed.
@elezar Thanks for the detailed explanation and the possible solutions to that. I will do a bit more in depth troubleshooting with the user (unrelated to the container-toolkit) to maybe find the cause of the issue.
I saw the update on his post just now. Really strange that Krusader seems to cause this.
I'll close this issue now since it seems to be solved. If we find out why it doesn't work after installing this specific container I'll post a follow up here.
Vielen Dank!
Hi @elezar, as promised here is a dedicated follow up issue (it took some time but I'm overwhelmed by work in real life).
A user on the unRAID forums reported that he get this message when trying to run Docker containers with his Nvidia GPU (GTX1070 <- details attached below):
He added that while the Docker container
Plex
is working fine (at times) with his Nvidia GPU the Docker containersTdarr
andUnmanic
are not.I see this issue from time pop up on the unRAID forums but it is definitely not the case on all systems. I saw this issue in the past where a user reported that and after upgrading his CPU and Motherboard the issue went away.
The user reported that this issue occurred after upgrading from unRAID 6.11.1 (Kernel v5.19.14) to unRAID 6.12.3 (Kernel v6.1.38). The only thing for the driver packages which is different is that he used on 6.11.1 the Nvidia driver
520.56.06
and on 6.12.3 the Nvidia driver535.98
<- driver package520.56.06
contains container-toolkit1.11.0
and driver package535.98
contains container-toolkit1.13.5
. So it is a bit hard to compare apples to apples.He already tried to boot with Legacy and UEFI, downgrade to the legacy Nvidia driver 470.199.02 (which contains container-toolkit
1.13.3
) but that made also no difference. What he also tried is passing through the GPU to a VM to ensure if it is working and it is working just fine.I already tried to reproduce this on my test system with Unraid 6.12.3 with the same Nvidia driver version
535.98
but the driver package is working perfectly fine with my Nvidia T400 and Docker containers (Plex, Jellyfin, Unmanic).Here are Diagnostics from his system including the syslog: nvidia-smi.txt motherboard.txt lspci.txt lscpu.txt syslog.txt
You can read the full report here and here.
If you need anything else please let me know.
Cheers, Christoph