nvidia-container-cli: mount error: failed to add device rules: unable to generate new device filter program from existing programs: unable to create new device filters program: load program: invalid argument: last insn is not an exit or jmp

DimanNe commented 2 years ago

After I updated the following packages:

nvidia-docker2:amd64 (2.10.0-1, 2.11.0-1)
libnvidia-container1:amd64 (1.9.0-1, 1.10.0-1)
libnvidia-container-tools:amd64 (1.9.0-1, 1.10.0-1)
nvidia-container-toolkit:amd64 (1.9.0-1, 1.10.0-1)

the following command:

docker run --rm --gpus all --runtime nvidia nvidia/cuda:11.0-base nvidia-smi

started to fail with the the following error:

docker: Error response from daemon: failed to create shim: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to generate new device filter program from existing programs: unable to create new device filters program: load program: invalid argument: last insn is not an exit or jmp
processed 0 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0: unknown.

Before the update I had no-cgroups = true in /etc/nvidia-container-runtime/config.toml (which I added based on this discussion). Ater the update it says:

[nvidia-container-cli]
#root = "/run/nvidia/driver"
#path = "/usr/bin/nvidia-container-cli"
environment = []
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
#no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"

System info

Command line: BOOT_IMAGE=/vmlinuz-5.11.0-16-generic root=/dev/mapper/vgkubuntu-root ro quiet splash vt.handoff=7 nvidia-drm.modeset=1

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04 LTS
Release:        22.04
Codename:       jammy

$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.73.05    Driver Version: 510.73.05    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:08:00.0  On |                  N/A |
|  0%   47C    P5    25W / 370W |   2110MiB / 10240MiB |     17%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:09:00.0  On |                  N/A |
|  0%   35C    P8    15W / 220W |    205MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

$ nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 3080 (UUID: GPU-22...)
GPU 1: NVIDIA GeForce RTX 3060 Ti (UUID: GPU-37...)

$ dpkg -l | grep systemd
ii  systemd                                       249.11-0ubuntu3.1                           amd64        system and service manager
ii  systemd-container                             249.11-0ubuntu3.1                           amd64        systemd container/nspawn tools
ii  systemd-sysv                                  249.11-0ubuntu3.1                           amd64        system and service manager - SysV links
ii  systemd-timesyncd                             249.11-0ubuntu3.1                           amd64        minimalistic service to synchronize local time with NTP servers

DimanNe commented 2 years ago

Workaround

The following workaround worked for me:

restore no-cgroups = true in /etc/nvidia-container-runtime/config.toml
start docker in this way: docker run --rm --gpus all --runtime nvidia --device /dev/nvidia0 --device /dev/nvidia1 --device /dev/nvidiactl --device /dev/nvidia-modeset --device /dev/nvidia-uvm nvidia/cuda:11.0-base nvidia-smi

elezar commented 2 years ago

@DimanNe depending on how the packages are installed, the config file may be overwritten by the package manager version.

As a matter of interest, which version were you upgrading from?

DimanNe commented 2 years ago

@elezar

the config file may be overwritten by the package manager version

Yeah, it was overwritten, you are right. It is not a problem/questions itself. I was just trying to explain why this particular upgrade caused it.

As a matter of interest, which version were you upgrading from?

At the top of my message, according to apt logs:

nvidia-docker2:amd64 (2.10.0-1, 2.11.0-1)
libnvidia-container1:amd64 (1.9.0-1, 1.10.0-1)
libnvidia-container-tools:amd64 (1.9.0-1, 1.10.0-1)
nvidia-container-toolkit:amd64 (1.9.0-1, 1.10.0-1)

elezar commented 2 years ago

Ah, sorry, I missed that you had both versions.

I would expect the same behaviour with the 1.9.0 and no-cgroups = false. The 1.8.0, 1.8.1, and 1.9.0 releases include changes to handle cgroupv2 and would generate an error like you're seeing. There were no cgroup-related changes in the 1.10.0 release.

@klueska may have more insights into what could be causing the specific error.

DimanNe commented 2 years ago

I would expect the same behaviour with the 1.9.0 and no-cgroups = false

Agree. I encountered same (similar) issue with one of the previous versions/updates, and then disabled cgroups via no-cgroups = true... So, it not a problem with this particular update actually...

DimanNe commented 2 years ago

@klueska Any news? Am I doing something wrong?

markusboehme commented 1 year ago

I've been hitting an issue with the same symptoms working on enabling cgroup v2 for Bottlerocket. In this case, I found the kernel's eBPF JIT hardening to contribute to the problem. Quoting from our issue:

Bottlerocket enables eBPF JIT hardening for both privileged and unprivileged users by default. One of the hardening measures is a constant blinding pass over eBPF bytecode loaded into the kernel, which applies slight modifications to the bytecode that preserve semantics but decrease attacker control over possible instruction sequences ending up in executable kernel memory. As a side effect, programs that have been blinded cannot be dumped to user space again.

It is the inability to dump eBPF programs to user space that is causing problems for libnvidia-container-go. When allowing GPU access to a container, it prepends new filters to the existing program (I assume put in place for the cgroup by runc). If constant blinding has been applied to the program, libnvidia-container-go will prepend the new filter to a buffer of zeros. "All zeros" happens to be a valid eBPF instruction encoding (loading a constant 32-bit 0 into register 0), so when the modified program is loaded back into the kernel, the eBPF verifier will only notice there's code path that does not explicitly terminate. This results in the error seen above, with the new device filter program being rejected on the grounds of "last insn is not an exit or jmp".

@DimanNe, do you happen to have the eBPF JIT hardening measures enabled as well? You can check by running sudo cat /proc/sys/net/core/bpf_jit_harden.

Ongy commented 1 year ago

@markusboehme thx for the hint.

Reducing the value of that sysctl from 2 to 1 fixed the issue for me. I had no luck running with the cgroups off either, so this was very helpful.

@elezar looks like there's some "bug" with the ebpf generation that makes the hardened checker reject it.

bioluks commented 5 months ago

The default linux-hardened kernel users are affected as well, coming from archlinux. A quick sudo sysctl -w net.core.bpf_jit_harden=1 (that won't survive a reboot, use a sysctl config for that) is a nice workaround for now, thanks to @markusboehme & @Ongy

NVIDIA / libnvidia-container

nvidia-container-cli: mount error: failed to add device rules: unable to generate new device filter program from existing programs: unable to create new device filters program: load program: invalid argument: last insn is not an exit or jmp #176

System info

Workaround