NVIDIA / libnvidia-container

NVIDIA container runtime library
Apache License 2.0
816 stars 200 forks source link

nvidia-container-cli: mount error: failed to add device rules, permission denied #209

Open sergeimonakhov opened 1 year ago

sergeimonakhov commented 1 year ago

Hi,

I tried to forward the GPU to the container using nvidia-container-toolkit v1.13.0, ubuntu 22.04(cgroupv2) and linux kernel 6.1.24. I got error:

nvidia-container-cli: mount error: failed to add device rules: unable to generate new device filter program with no existing programs: unable to create new device filters program: load program: permission denied: 0: R1=ctx(off=0,imm=0) R10=fp0\\n0: (69) r2 = *(u16 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R2_w=scalar(umax=65535,var_off=(0x0; 0xffff))\\n1: (61) r3 = *(u32 *)(r1 +0)          ; R1=ctx(off=0,imm=0) R3_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))\\n2: (74) w3 >>= 16                     ; R3_w=scalar(umax=65535,var_off=(0x0; 0xffff))\\n3: (61) r4 = *(u32 *)(r1 +4)          ; R1=ctx(off=0,imm=0) R4_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))\\n4: (61) r5 = *(u32 *)(r1 +8)          ; R1=ctx(off=0,imm=0) R5_w=scalar(umax=4294967295,var_off=(0x0; 0xffffffff))\\n5: (55) if r2 != 0x2 goto pc+7        ; R2_w=2\\n6: (bc) w2 = w3                       ; R2_w=scalar(umax=65535,var_off=(0x0; 0xffff)) R3_w=scalar(umax=65535,var_off=(0x0; 0xffff))\\n7: (54) w2 &= 6                       ; R2_w=scalar(umax=6,var_off=(0x0; 0x6))\\n8: (15) if r2 == 0x0 goto pc+4        ; R2_w=scalar(umax=6,var_off=(0x0; 0x6))\\n9: (55) if r4 != 0xc3 goto pc+3       ; R4=195\\n10: (55) if r5 != 0xff goto pc+2 13: R1=ctx(off=0,imm=0) R2=scalar(umax=6,var_off=(0x0; 0x6)) R3=scalar(umax=65535,var_off=(0x0; 0xffff)) R4=195 R5=scalar(umax=4294967295,var_off=(0x0; 0xffffffff)) R10=fp0\\n13: (95) exit\\nR0 !read_ok\\nprocessed 14 insns (limit 1000000) max_states_per_insn 0 total_states 1 peak_states 1 mark_read 1\\n\

If I use old kernel <6.x.x then there is no such problem and also if we switch to cgroupv1

elezar commented 1 year ago

Could this be related to https://github.com/NVIDIA/libnvidia-container/issues/176 and the hardening around eBPF programs? Does the workaround suggested there also work?