Open DimanNe opened 2 years ago
The following workaround worked for me:
no-cgroups = true
in /etc/nvidia-container-runtime/config.toml
docker run --rm --gpus all --runtime nvidia --device /dev/nvidia0 --device /dev/nvidia1 --device /dev/nvidiactl --device /dev/nvidia-modeset --device /dev/nvidia-uvm nvidia/cuda:11.0-base nvidia-smi
@DimanNe depending on how the packages are installed, the config file may be overwritten by the package manager version.
As a matter of interest, which version were you upgrading from?
@elezar
the config file may be overwritten by the package manager version
Yeah, it was overwritten, you are right. It is not a problem/questions itself. I was just trying to explain why this particular upgrade caused it.
As a matter of interest, which version were you upgrading from?
At the top of my message, according to apt logs:
nvidia-docker2:amd64 (2.10.0-1, 2.11.0-1)
libnvidia-container1:amd64 (1.9.0-1, 1.10.0-1)
libnvidia-container-tools:amd64 (1.9.0-1, 1.10.0-1)
nvidia-container-toolkit:amd64 (1.9.0-1, 1.10.0-1)
Ah, sorry, I missed that you had both versions.
I would expect the same behaviour with the 1.9.0
and no-cgroups = false
. The 1.8.0
, 1.8.1
, and 1.9.0
releases include changes to handle cgroupv2 and would generate an error like you're seeing. There were no cgroup-related changes in the 1.10.0
release.
@klueska may have more insights into what could be causing the specific error.
I would expect the same behaviour with the 1.9.0 and
no-cgroups = false
Agree. I encountered same (similar) issue with one of the previous versions/updates, and then disabled cgroups via no-cgroups = true
... So, it not a problem with this particular update actually...
@klueska Any news? Am I doing something wrong?
I've been hitting an issue with the same symptoms working on enabling cgroup v2 for Bottlerocket. In this case, I found the kernel's eBPF JIT hardening to contribute to the problem. Quoting from our issue:
Bottlerocket enables eBPF JIT hardening for both privileged and unprivileged users by default. One of the hardening measures is a constant blinding pass over eBPF bytecode loaded into the kernel, which applies slight modifications to the bytecode that preserve semantics but decrease attacker control over possible instruction sequences ending up in executable kernel memory. As a side effect, programs that have been blinded cannot be dumped to user space again.
It is the inability to dump eBPF programs to user space that is causing problems for
libnvidia-container-go
. When allowing GPU access to a container, it prepends new filters to the existing program (I assume put in place for the cgroup by runc). If constant blinding has been applied to the program,libnvidia-container-go
will prepend the new filter to a buffer of zeros. "All zeros" happens to be a valid eBPF instruction encoding (loading a constant 32-bit 0 into register 0), so when the modified program is loaded back into the kernel, the eBPF verifier will only notice there's code path that does not explicitly terminate. This results in the error seen above, with the new device filter program being rejected on the grounds of "last insn is not an exit or jmp".
@DimanNe, do you happen to have the eBPF JIT hardening measures enabled as well? You can check by running sudo cat /proc/sys/net/core/bpf_jit_harden
.
@markusboehme thx for the hint.
Reducing the value of that sysctl
from 2
to 1
fixed the issue for me.
I had no luck running with the cgroups off either, so this was very helpful.
@elezar looks like there's some "bug" with the ebpf generation that makes the hardened checker reject it.
The default linux-hardened
kernel users are affected as well, coming from archlinux. A quick sudo sysctl -w net.core.bpf_jit_harden=1
(that won't survive a reboot, use a sysctl config for that) is a nice workaround for now, thanks to @markusboehme & @Ongy
After I updated the following packages:
the following command:
started to fail with the the following error:
Before the update I had
no-cgroups = true
in/etc/nvidia-container-runtime/config.toml
(which I added based on this discussion). Ater the update it says:System info