NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.5k stars 270 forks source link

nvidia-container-runtime unable to signal init: permission denied #796

Open sense-amid-madness opened 1 week ago

sense-amid-madness commented 1 week ago

Hi, on one of my GPU servers, GPU containers using the nvidia container runtime fail to terminate due to permission issues, what could be the cause of this? They start up and run fine.

The error appears when trying to shutdown a container:

sudo ctr -n k8s.io task kill fddedcb271ff4df58b5e539fb246ca86700db730ecde0ae7c38be0d1c77d39e1 ctr: unknown error after kill: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1: unable to signal init: permission denied : unknown

Toolkit version is 1.17.1, containerd version 1.7.12.

Thanks much.

sense-amid-madness commented 1 week ago

I found the solution to the issue - for anybody stumbling over this thread with the same problem, I'll leave it here.

The issue is actually not with nvidia-container-runtime, but with a broken AppArmor profile which prevents runc from signaling a kill command to containers, as documented here:

https://github.com/moby/moby/pull/47749

A quick (and very dirty) workaround is to move the runc executable from /usr/sbin/runc to /usr/bin/runc, as it then runs without the broken AppArmor profile. All containers stuck on Terminating were killed immediately, and everything worked fine again.