NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.23k stars 243 forks source link

nvidia-container-cli: mount error: failed to add device rules.... bpf_prog_query(BPF_CGROUP_DEVICE) #139

Open Sparticuz opened 1 year ago

Sparticuz commented 1 year ago

1. Issue or feature description

Failing to start HW Accelerated containers.

kyle@bently 03:50:18 /var/log $ sudo docker run --gpus all nvidia/cuda:11.3.0-runtime-ubuntu20.04 nvidia-smi
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: mount error: failed to add device rules: unable to find any existing device filters attached to the cgroup: bpf_prog_query(BPF_CGROUP_DEVICE) failed: invalid argument: unknown.
ERRO[0002] error waiting for container: context canceled

It was working before updating to 1.11, however, downgrading to 1.10 doesn't seem to fix anything.

2. Steps to reproduce the issue

3. Information to attach (optional if deemed irrelevant)

-- WARNING, the following logs are for debugging purposes only --

I1007 19:50:18.398032 1585041 nvc.c:376] initializing library context (version=1.11.0, build=0000000000000000000000000000000000000000) I1007 19:50:18.398112 1585041 nvc.c:350] using root /
I1007 19:50:18.398118 1585041 nvc.c:351] using ldcache /etc/ld.so.cache
I1007 19:50:18.398126 1585041 nvc.c:352] using unprivileged user 65534:65534
I1007 19:50:18.398147 1585041 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL) I1007 19:50:18.398218 1585041 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment I1007 19:50:18.399247 1585042 nvc.c:278] loading kernel module nvidia
I1007 19:50:18.399523 1585042 nvc.c:282] running mknod for /dev/nvidiactl
I1007 19:50:18.399583 1585042 nvc.c:286] running mknod for /dev/nvidia0
I1007 19:50:18.399633 1585042 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps I1007 19:50:18.404845 1585042 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config I1007 19:50:18.404948 1585042 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor I1007 19:50:18.406809 1585042 nvc.c:296] loading kernel module nvidia_uvm
I1007 19:50:18.406870 1585042 nvc.c:300] running mknod for /dev/nvidia-uvm
I1007 19:50:18.406936 1585042 nvc.c:305] loading kernel module nvidia_modeset
I1007 19:50:18.406989 1585042 nvc.c:309] running mknod for /dev/nvidia-modeset I1007 19:50:18.407279 1585043 rpc.c:71] starting driver rpc service
I1007 19:50:18.648751 1585049 rpc.c:71] starting nvcgo rpc service
I1007 19:50:18.706133 1585041 nvc_info.c:766] requesting driver information with ''
I1007 19:50:18.708076 1585041 nvc_info.c:173] selecting /usr/lib/libnvoptix.so.470.141.03 I1007 19:50:18.708181 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-tls.so.470.141.03 I1007 19:50:18.708256 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-rtcore.so.470.141.03
I1007 19:50:18.708332 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-ptxjitcompiler.so.470.141.03 I1007 19:50:18.708412 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-opticalflow.so.470.141.03 I1007 19:50:18.708489 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-opencl.so.470.141.03 I1007 19:50:18.708583 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-ngx.so.470.141.03 I1007 19:50:18.708658 1585041 nvc_info.c:175] skipping /usr/lib/libnvidia-ml.so.1.fix
I1007 19:50:18.708712 1585041 nvc_info.c:175] skipping /usr/lib/libnvidia-ml.so.1.fix
I1007 19:50:18.708766 1585041 nvc_info.c:175] skipping /usr/lib/libnvidia-ml.so.1.fix I1007 19:50:18.708812 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-ifr.so.470.141.03 I1007 19:50:18.708888 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-glvkspirv.so.470.141.03
I1007 19:50:18.708962 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-glsi.so.470.141.03 I1007 19:50:18.709034 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-glcore.so.470.141.03 I1007 19:50:18.709107 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-fbc.so.470.141.03 I1007 19:50:18.709180 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-encode.so.470.141.03 I1007 19:50:18.709251 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-eglcore.so.470.141.03 I1007 19:50:18.709330 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-compiler.so.470.141.03 I1007 19:50:18.709402 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-cfg.so.470.141.03 I1007 19:50:18.709474 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-cbl.so.470.141.03 I1007 19:50:18.709546 1585041 nvc_info.c:173] selecting /usr/lib/libnvidia-allocator.so.470.141.03 I1007 19:50:18.709623 1585041 nvc_info.c:173] selecting /usr/lib/libnvcuvid.so.470.141.03 I1007 19:50:18.709990 1585041 nvc_info.c:173] selecting /usr/lib/libcuda.so.470.141.03 I1007 19:50:18.710305 1585041 nvc_info.c:173] selecting /usr/lib/libGLX_nvidia.so.470.141.03 I1007 19:50:18.710408 1585041 nvc_info.c:173] selecting /usr/lib/libGLESv2_nvidia.so.470.141.03 I1007 19:50:18.710491 1585041 nvc_info.c:173] selecting /usr/lib/libGLESv1_CM_nvidia.so.470.141.03 I1007 19:50:18.710576 1585041 nvc_info.c:173] selecting /usr/lib/libEGL_nvidia.so.470.141.03 I1007 19:50:18.710774 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-tls.so.470.141.03
I1007 19:50:18.710847 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-ptxjitcompiler.so.470.141.03 I1007 19:50:18.710926 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-opticalflow.so.470.141.03 I1007 19:50:18.710997 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-opencl.so.470.141.03
I1007 19:50:18.711074 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-ml.so.470.141.03 I1007 19:50:18.711146 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-ifr.so.470.141.03 I1007 19:50:18.711220 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-glvkspirv.so.470.141.03 I1007 19:50:18.711294 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-glsi.so.470.141.03 I1007 19:50:18.711367 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-glcore.so.470.141.03 I1007 19:50:18.711443 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-fbc.so.470.141.03 I1007 19:50:18.711520 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-encode.so.470.141.03 I1007 19:50:18.711619 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-eglcore.so.470.141.03 I1007 19:50:18.711702 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-compiler.so.470.141.03 I1007 19:50:18.711773 1585041 nvc_info.c:173] selecting /usr/lib32/libnvidia-allocator.so.470.141.03 I1007 19:50:18.711846 1585041 nvc_info.c:173] selecting /usr/lib32/libnvcuvid.so.470.141.03 I1007 19:50:18.712032 1585041 nvc_info.c:173] selecting /usr/lib32/libcuda.so.470.141.03 I1007 19:50:18.712169 1585041 nvc_info.c:173] selecting /usr/lib32/libGLX_nvidia.so.470.141.03 I1007 19:50:18.712248 1585041 nvc_info.c:173] selecting /usr/lib32/libGLESv2_nvidia.so.470.141.03 I1007 19:50:18.712325 1585041 nvc_info.c:173] selecting /usr/lib32/libGLESv1_CM_nvidia.so.470.141.03 I1007 19:50:18.712427 1585041 nvc_info.c:173] selecting /usr/lib32/libEGL_nvidia.so.470.141.03 W1007 19:50:18.712497 1585041 nvc_info.c:399] missing library libnvidia-ml.so W1007 19:50:18.712506 1585041 nvc_info.c:399] missing library libnvidia-nscq.so W1007 19:50:18.712514 1585041 nvc_info.c:399] missing library libcudadebugger.so W1007 19:50:18.712536 1585041 nvc_info.c:399] missing library libnvidia-fatbinaryloader.so W1007 19:50:18.712544 1585041 nvc_info.c:399] missing library libnvidia-pkcs11.so W1007 19:50:18.712553 1585041 nvc_info.c:399] missing library libvdpau_nvidia.so W1007 19:50:18.712563 1585041 nvc_info.c:403] missing compat32 library libnvidia-cfg.so W1007 19:50:18.712572 1585041 nvc_info.c:403] missing compat32 library libnvidia-nscq.so W1007 19:50:18.712582 1585041 nvc_info.c:403] missing compat32 library libcudadebugger.so W1007 19:50:18.712591 1585041 nvc_info.c:403] missing compat32 library libnvidia-fatbinaryloader.so W1007 19:50:18.712601 1585041 nvc_info.c:403] missing compat32 library libnvidia-pkcs11.so W1007 19:50:18.712611 1585041 nvc_info.c:403] missing compat32 library libnvidia-ngx.so W1007 19:50:18.712620 1585041 nvc_info.c:403] missing compat32 library libvdpau_nvidia.so W1007 19:50:18.712629 1585041 nvc_info.c:403] missing compat32 library libnvidia-rtcore.so W1007 19:50:18.712638 1585041 nvc_info.c:403] missing compat32 library libnvoptix.so W1007 19:50:18.712646 1585041 nvc_info.c:403] missing compat32 library libnvidia-cbl.so I1007 19:50:18.713291 1585041 nvc_info.c:299] selecting /usr/bin/nvidia-smi I1007 19:50:18.713319 1585041 nvc_info.c:299] selecting /usr/bin/nvidia-debugdump I1007 19:50:18.713359 1585041 nvc_info.c:299] selecting /usr/bin/nvidia-persistenced I1007 19:50:18.713406 1585041 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-control I1007 19:50:18.713454 1585041 nvc_info.c:299] selecting /usr/bin/nvidia-cuda-mps-server W1007 19:50:18.713827 1585041 nvc_info.c:425] missing binary nv-fabricmanager I1007 19:50:18.713876 1585041 nvc_info.c:343] listing firmware path /usr/lib/firmware/nvidia/470.141.03/gsp.bin I1007 19:50:18.713932 1585041 nvc_info.c:529] listing device /dev/nvidiactl I1007 19:50:18.713941 1585041 nvc_info.c:529] listing device /dev/nvidia-uvm I1007 19:50:18.713951 1585041 nvc_info.c:529] listing device /dev/nvidia-uvm-tools I1007 19:50:18.713960 1585041 nvc_info.c:529] listing device /dev/nvidia-modeset W1007 19:50:18.713999 1585041 nvc_info.c:349] missing ipc path /var/run/nvidia-persistenced/socket W1007 19:50:18.714037 1585041 nvc_info.c:349] missing ipc path /var/run/nvidia-fabricmanager/socket W1007 19:50:18.714066 1585041 nvc_info.c:349] missing ipc path /tmp/nvidia-mps I1007 19:50:18.714076 1585041 nvc_info.c:822] requesting device information with '' I1007 19:50:18.720035 1585041 nvc_info.c:713] listing device /dev/nvidia0 (GPU-6e529d4a-43f3-6d1a-c5c0-125845d08dfd at 00000000:01:00.0) NVRM version: 470.141.03 CUDA version: 11.4

Device Index: 0 Device Minor: 0 Model: NVIDIA GeForce GTX 650 Brand: Quadro GPU UUID: GPU-6e529d4a-43f3-6d1a-c5c0-125845d08dfd Bus Location: 00000000:01:00.0 Architecture: 3.0 I1007 19:50:18.720144 1585041 nvc.c:434] shutting down library context I1007 19:50:18.768871 1585049 rpc.c:95] terminating nvcgo rpc service I1007 19:50:18.769886 1585041 rpc.c:132] nvcgo rpc service terminated successfully I1007 19:50:18.806999 1585043 rpc.c:95] terminating driver rpc service I1007 19:50:18.807556 1585041 rpc.c:132] driver rpc service terminated successfully

 - [x] Kernel version from `uname -a`

kyle@bently 03:56:13 /var/log $ uname -r 5.19.13-arch1-1

 - [x] Any relevant kernel output lines from `dmesg`

[259611.206876] docker0: port 1(veth14ba716) entered blocking state [259611.206941] docker0: port 1(veth14ba716) entered disabled state [259611.207399] device veth14ba716 entered promiscuous mode [259611.207460] audit: type=1700 audit(1665172730.283:2222): dev=veth14ba716 prom=256 old_prom=0 auid=4294967295 uid=0 gid=0 ses=4294967295 [259611.207762] audit: type=1300 audit(1665172730.283:2222): arch=c000003e syscall=44 success=yes exit=40 a0=f a1=c0029313b0 a2=28 a3=0 items=0 ppid=1 pid=1722 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="dockerd" exe="/usr/bin/dockerd" key=(null) [259611.207805] audit: type=1327 audit(1665172730.283:2222): proctitle=2F7573722F62696E2F646F636B657264002D480066643A2F2F [259611.322435] audit: type=1334 audit(1665172730.400:2223): prog-id=451 op=LOAD [259611.322844] audit: type=1334 audit(1665172730.400:2224): prog-id=452 op=LOAD [259611.322850] audit: type=1300 audit(1665172730.400:2224): arch=c000003e syscall=321 success=yes exit=15 a0=5 a1=c0001ad7f8 a2=78 a3=0 items=0 ppid=1597175 pid=1597187 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="runc" exe="/usr/bin/runc" key=(null) [259611.322854] audit: type=1327 audit(1665172730.400:2224): proctitle=72756E63002D2D726F6F74002F7661722F72756E2F646F636B65722F72756E74696D652D72756E632F6D6F6279002D2D6C6F67002F7661722F72756E2F646F636B65722F636F6E7461696E6572642F6461656D6F6E2F696F2E636F6E7461696E6572642E72756E74696D652E76322E7461736B2F6D6F62792F65633335383937 [259611.935274] docker0: port 1(veth14ba716) entered disabled state [259611.936111] device veth14ba716 left promiscuous mode [259611.936121] docker0: port 1(veth14ba716) entered disabled state

 - [x] Driver information from `nvidia-smi -a`

kyle@bently 03:59:08 /var/log $ sudo nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Fri Oct 7 15:59:28 2022
Driver Version : 470.141.03
CUDA Version : 11.4

Attached GPUs : 1
GPU 00000000:01:00.0
Product Name : NVIDIA GeForce GTX 650
Product Brand : Quadro
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : GPU-6e529d4a-43f3-6d1a-c5c0-125845d08dfd
Minor Number : 0
VBIOS Version : 80.07.35.00.54
MultiGPU Board : No
Board ID : 0x100
GPU Part Number : N/A
Module ID : 0
Inforom Version
Image Version : N/A
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GSP Firmware Version : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x01
Device : 0x00
Domain : 0x0000
Device Id : 0x0FC610DE Bus Id : 00000000:01:00.0 Sub System Id : 0x26513842 GPU Link Info PCIe Generation Max : 3 Current : 3 Link Width Max : 16x Current : 16x Bridge Chip Type : N/A Firmware : N/A Replays Since Reset : 0 Replay Number Rollovers : 0 Tx Throughput : N/A [56/4404] Rx Throughput : N/A Fan Speed : 21 % Performance State : P0 Clocks Throttle Reasons Idle : Not Active Applications Clocks Setting : Not Active SW Power Cap : Not Active HW Slowdown : Not Active HW Thermal Slowdown : N/A HW Power Brake Slowdown : N/A Sync Boost : Not Active SW Thermal Slowdown : Not Active Display Clock Setting : Not Active FB Memory Usage Total : 1996 MiB Used : 0 MiB Free : 1996 MiB BAR1 Memory Usage Total : 256 MiB Used : 2 MiB Free : 254 MiB Compute Mode : Default Utilization Gpu : 0 % Memory : 0 % Encoder : 0 % Decoder : 0 % Encoder Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 FBC Stats Active Sessions : 0 Average FPS : 0 Average Latency : 0 Ecc Mode Current : N/A Pending : N/A ECC Errors Volatile Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Aggregate Single Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Double Bit
Device Memory : N/A Register File : N/A L1 Cache : N/A L2 Cache : N/A Texture Memory : N/A Texture Shared : N/A CBU : N/A Total : N/A Retired Pages Single Bit ECC : N/A Double Bit ECC : N/A Pending Page Blacklist : N/A Remapped Rows : N/A Temperature GPU Current Temp : 41 C GPU Shutdown Temp : 103 C GPU Slowdown Temp : 98 C GPU Max Operating Temp : N/A GPU Target Temperature : N/A Memory Current Temp : N/A Memory Max Operating Temp : N/A Power Readings Power Management : N/A Power Draw : N/A Power Limit : N/A Default Power Limit : N/A Enforced Power Limit : N/A Min Power Limit : N/A Max Power Limit : N/A Clocks Graphics : 1058 MHz SM : 1058 MHz Memory : 2500 MHz Video : 540 MHz Applications Clocks Graphics : N/A Memory : N/A Default Applications Clocks Graphics : N/A Memory : N/A Max Clocks Graphics : 1058 MHz SM : 1058 MHz Memory : 2500 MHz Video : 540 MHz Max Customer Boost Clocks Graphics : N/A Clock Policy Auto Boost : N/A Auto Boost Default : N/A Voltage Graphics : N/A Processes : None

 - [x] Docker version from `docker version`

kyle@bently 03:59:28 /var/log $ sudo docker version Client: Version: 20.10.18 API version: 1.41 Go version: go1.19.1 Git commit: b40c2f6b5d Built: Sat Sep 10 11:31:10 2022 OS/Arch: linux/amd64 Context: default Experimental: true

Server: Engine: Version: 20.10.18 API version: 1.41 (minimum version 1.12) Go version: go1.19.1 Git commit: e42327a6d3 Built: Sat Sep 10 11:30:17 2022 OS/Arch: linux/amd64 Experimental: false containerd: Version: v1.6.8 GitCommit: 9cd3357b7fd7218e4aec3eae239db1f68a5a6ec6.m runc: Version: 1.1.4 GitCommit:
docker-init: Version: 0.19.0 GitCommit: de40ad0

 - [x] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`

kyle@bently 04:00:59 /var/log $ sudo pacman -Q | grep nvidia lib32-nvidia-470xx-utils 470.141.03-1 lib32-opencl-nvidia-470xx 470.141.03-1 libnvidia-container 1.11.0-1 libnvidia-container-tools 1.11.0-1 nvidia-470xx-dkms 470.141.03-1 nvidia-470xx-utils 470.141.03-1 nvidia-container-runtime-bin 3.5.0-2 nvidia-container-toolkit 1.11.0-1 opencl-nvidia-470xx 470.141.03-1

 - [x] NVIDIA container library version from `nvidia-container-cli -V`

kyle@bently 04:01:30 /var/log $ sudo nvidia-container-cli -V cli-version: 1.11.0 lib-version: 1.11.0 build date: 2022-10-01T02:17+00:00 build revision: 0000000000000000000000000000000000000000 build compiler: gcc 12.2.0 build platform: x86_64 build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -I/usr/include/tirpc -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -fexceptions -Wp,-D_FORTIFY_SOURCE=2 -Wformat -Werror=format-security -fstack-clash-protection -fcf-protection -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now


 - [x] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
   - https://termbin.com/rnmc
 - [x] Docker command, image and tag used
`sudo docker run --gpus all nvidia/cuda:11.3.0-runtime-ubuntu20.04 nvidia-smi`
wasdee commented 1 year ago

+1

sudo docker run --gpus all nvidia/cuda:11.3.0-runtime-ubuntu20.04 nvidia-smi

This is working for me.

./docker/build.sh --file docker/ubuntu-20.04.Dockerfile --tag tensorrt-ubuntu20.04-cuda11.7

I failed when trying to build TensorRT.

I already change my cgroup to v2

wasdee commented 1 year ago

I solved my error with this.

chenyg0911 commented 6 months ago

FYI. My solution:

  1. in lxd/incus container, edit /etc/nvidia-container-runtime/config.toml, setting "no-cgroups = true"
  2. in host, run : lxc config device add $CONTAINER volume-name disk path=/proc/driver/nvidia/gpus/0000:01:00.0 source=/proc/driver/nvidia/gpus/0000:01:00.0 , the path shoud according yours.

That's all. GPU can runs on container docker now.

alanocallaghan commented 6 days ago

I just set no-cgroups = true in the host system's /etc/nvidia-container-runtime/config.toml and this seems to work, thanks @chenyg0911