NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.21k stars 241 forks source link

Failure to resolve symlinks for firmware in CDI spec generation #671

Open elezar opened 2 weeks ago

elezar commented 2 weeks ago

When resolving firmware paths, we don't seem to resolve symlinks. This may cause issues on systems where /lib -> /usr/lib.

See https://github.com/canonical/lxd/pull/13562/files#r1701610711

gfrankliu commented 2 weeks ago

In my ticket https://github.com/NVIDIA/nvidia-container-toolkit/issues/672 where cloud has nvidia driver installed in /var/lib/nvidia/, nvidia-container-cli -k -d /dev/tty info also complained

W0829 16:57:45.375509 1151 nvc_info.c:470] missing firmware path /usr/lib/firmware/nvidia/535.183.01/gsp*.bin

the firmware is actually located in the NVIDIA_ROOT (/var/lib/nvidia):

gfrankliu-t4-ws ➜  ~ ls -l /var/lib/nvidia
total 427620
-rw-r--r-- 1 root root 341725273 Aug 29 18:47 NVIDIA-Linux-x86_64-535.183.01.run
drwxr-xr-x 2 root root      4096 Aug 29 18:47 bin
drwxr-xr-x 3 root root      4096 Aug 29 18:47 bin-workdir
drwxr-xr-x 2 root root      4096 Aug 10 14:54 drivers
drwxr-xr-x 3 root root      4096 Aug 29 18:47 drivers-workdir
drwxr-xr-x 3 root root      4096 Aug 10 14:54 firmware
-rw-r--r-- 1 root root      2970 Aug 29 18:47 gpu_driver_versions.bin
drwxr-xr-x 5 root root      4096 Aug 29 18:47 lib64
drwxr-xr-x 3 root root      4096 Aug 29 18:47 lib64-workdir
-rw-r--r-- 1 root root  96106018 Aug 29 18:47 nvidia-drivers-535.183.01.tgz
-rw-r--r-- 1 root root      2355 Aug 29 18:47 nvidia-installer.log
drwxr-xr-x 4 root root      4096 Aug 29 18:47 share
gfrankliu-t4-ws ➜  ~ ls -l /var/lib/nvidia/firmware/nvidia/535.183.01 
total 60540
-rw-r--r-- 1 1000 250 38159904 May 12 19:08 gsp_ga10x.bin
-rw-r--r-- 1 1000 250 23820576 May 12 19:08 gsp_tu10x.bin
gfrankliu-t4-ws ➜  ~ 

Does nvidia-container-toolkit only support when nvidia driver is installed in the default location?