Open sense-amid-madness opened 1 week ago
I found the solution to the issue - for anybody stumbling over this thread with the same problem, I'll leave it here.
The issue is actually not with nvidia-container-runtime, but with a broken AppArmor profile which prevents runc from signaling a kill command to containers, as documented here:
https://github.com/moby/moby/pull/47749
A quick (and very dirty) workaround is to move the runc executable from /usr/sbin/runc to /usr/bin/runc, as it then runs without the broken AppArmor profile. All containers stuck on Terminating were killed immediately, and everything worked fine again.
Hi, on one of my GPU servers, GPU containers using the nvidia container runtime fail to terminate due to permission issues, what could be the cause of this? They start up and run fine.
The error appears when trying to shutdown a container:
Toolkit version is 1.17.1, containerd version 1.7.12.
Thanks much.