canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.38k stars 931 forks source link

NVIDIA-SMI couldn't find libnvidia-ml.so library in your system #9964

Closed trippwill closed 2 years ago

trippwill commented 2 years ago

Required information

Issue description

Attempting to create a new container with nvidia gpu passthrough.

I was following this tutorial (with a debian image instead of ubuntu): https://ubuntu.com/tutorials/gpu-data-processing-inside-lxd#1-overview

actual: lxc exec fred -- nvidia-smi fails with error message indicating a library could not be found

expected: output from nvidia-smi from the container

Steps to reproduce

  1. Nvidia drivers 470 and Cuda toolkit 11-4 already installed and confirmed working on the host using the bandwidthTest and nvidia-smi Wed Feb 23 19:29:59 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... On | 00000000:25:00.0 N/A | N/A | | N/A 39C P8 N/A / N/A | 1MiB / 1999MiB | N/A Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Quadro P620 On | 00000000:27:00.0 Off | N/A | | 34% 34C P8 N/A / N/A | 1MiB / 2000MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

  1. $ lxc launch images:debian/bullseye fred -c nvidia.runtime=true Creating fred Starting fred
  2. $ lxc config device add fred gpu gpu Device gpu added to fred
  3. $ lxc exec fred -- nvidia-smi NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Result is the same doing it as root, or using an ubuntu image for the container

Saw this issue:https://github.com/lxc/lxd/issues/7840 but unmounting /proc/driver/nvidia inside the container didn't make a difference.

Information to attach

fred-log.txt fred-configuration.txt lxd-log.txt fred-debug-output.txt

stgraber commented 2 years ago

Can you try with lxc launch images:ubuntu/20.04 foo -c nvidia.runtime=true instead and see if that gives you the same result? If it does, can you show cat /proc/mounts from inside the container?

stgraber commented 2 years ago

Googling the error it looks like it may have to do with the linker inside the container not working quite the same way as the linker on the host causing some issues, but it could also be that nvidia-container somehow didn't pass in that particular library for some reason.

trippwill commented 2 years ago

Should have mentioned I already tried it with ubuntu, but gave it another go just now. Same results:

$ lxc launch images:ubuntu/20.04 foo -c nvidia.runtime=true Creating foo Starting foo

$ lxc exec foo -- nvidia-smi NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.

$ lxc exec foo -- cat /proc/mounts tank/lxd-default/containers/foo / zfs rw,relatime,xattr,posixacl 0 0 none /dev tmpfs rw,relatime,size=492k,mode=755,uid=1000000,gid=1000000 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 udev /dev/fuse devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/net/tun devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0 efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0 tracefs /sys/kernel/debug/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0 mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/lxd tmpfs rw,relatime,size=100k,mode=755 0 0 tmpfs /dev/.lxd-mounts tmpfs rw,relatime,size=100k,mode=711 0 0 none /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0 lxcfs /proc/cpuinfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/diskstats fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/loadavg fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/meminfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/swaps fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/uptime fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /sys/devices/system/cpu/online fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 tmpfs /proc/driver/nvidia tmpfs rw,nosuid,nodev,noexec,relatime,mode=555,uid=1000000,gid=1000000 0 0 /dev/sda2 /usr/bin/nvidia-smi ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/bin/nvidia-debugdump ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/bin/nvidia-persistenced ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/bin/nvidia-cuda-mps-control ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/bin/nvidia-cuda-mps-server ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libcuda.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/firmware/nvidia/470.103.01/gsp.bin ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 tmpfs /run/nvidia-persistenced/socket tmpfs rw,nosuid,nodev,noexec,relatime,size=3286584k,mode=755 0 0 udev /dev/nvidiactl devtmpfs ro,nosuid,noexec,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/nvidia-uvm devtmpfs ro,nosuid,noexec,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/nvidia-uvm-tools devtmpfs ro,nosuid,noexec,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/full devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/null devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/random devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/tty devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/urandom devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/zero devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0 devpts /dev/ptmx devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0 devpts /dev/console devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0 none /proc/sys/kernel/random/boot_id tmpfs ro,nosuid,nodev,noexec,relatime,size=492k,mode=755,uid=1000000,gid=1000000 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,uid=1000000,gid=1000000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,size=3286584k,mode=755,uid=1000000,gid=1000000 0 0 tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,uid=1000000,gid=1000000 0 0

trippwill commented 2 years ago

I do wonder if it's related to the fact I'm on 470 drivers and cuda 11.4. Unfortunately the GT710 is only supported up to the 470 drivers.

stgraber commented 2 years ago

The library appears to be passed through properly. Can you show:

ldd /usr/bin/nvidia-smi
ldd /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470
trippwill commented 2 years ago

$ ldd /usr/bin/nvidia-smi linux-vdso.so.1 (0x00007ffe715b8000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fac92279000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fac92273000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fac92081000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fac92076000) /lib64/ld-linux-x86-64.so.2 (0x00007fac922a1000)

$ ldd /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470 ldd: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470: No such file or directory

$ ldd /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 linux-vdso.so.1 (0x00007ffea41d3000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffb739ce000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffb739c8000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffb737d6000) /lib64/ld-linux-x86-64.so.2 (0x00007ffb7408e000)

Don't know if this is relevant: root@foo:/# find -L / ( -type d -name proc -o -type d -name sys ) -prune -o -type f -name libnvidia-ml* /sys /proc find: File system loop detected; ‘/dev/fd/3’ is part of the same file system loop as ‘/’. find: ‘/dev/fd/4’: No such file or directory find: ‘/dev/.lxd-mounts’: Permission denied /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 /lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01

stgraber commented 2 years ago

Taking a look at this now in our test system which also uses GT710 or similarly old GPU.

stgraber commented 2 years ago

My setup is Ubuntu 20.04 server with the stock 5.4 kernel using:

apt-get update
apt-get dist-upgrade --yes
apt-get install linux-generic --yes
apt-get remove --purge --yes linux.*hwe.* --yes
apt-get install nvidia-utils-470 linux-modules-nvidia-470-generic libnvidia-compute-470 --yes

echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
sudo update-initramfs -u

After doing that and rebooting:

root@vm12:~# nvidia-smi 
Thu Feb 24 18:42:13 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:06:00.0 N/A |                  N/A |
| 56%   64C    P0    N/A /  N/A |      0MiB /  2002MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:07:00.0 N/A |                  N/A |
| 54%   61C    P0    N/A /  N/A |      0MiB /  2002MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@vm12:~# snap refresh lxd --channel=latest
lxd 4.23 from Canonical✓ refreshed
root@vm12:~# lxd init --auto
root@vm12:~# 

And now testing a container:

root@vm12:~# lxc launch images:debian/bullseye fred -c nvidia.runtime=true
Creating fred
Starting fred                                 
root@vm12:~# lxc config device add fred gpu gpu
Device gpu added to fred
root@vm12:~# lxc exec fred -- nvidia-smi
Thu Feb 24 18:45:13 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:06:00.0 N/A |                  N/A |
| 36%   55C    P0    N/A /  N/A |      0MiB /  2002MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  Off  | 00000000:07:00.0 N/A |                  N/A |
| 35%   52C    P0    N/A /  N/A |      0MiB /  2002MiB |     N/A      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@vm12:~# 
stgraber commented 2 years ago

Mount table here looks like this:

root@vm12:~# lxc exec fred -- cat /proc/mounts
/dev/sda2 / ext4 rw,relatime 0 0
none /dev tmpfs rw,relatime,size=492k,mode=755,uid=1000000,gid=1000000 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
udev /dev/fuse devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/net/tun devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0
efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/lxd tmpfs rw,relatime,size=100k,mode=755 0 0
tmpfs /dev/.lxd-mounts tmpfs rw,relatime,size=100k,mode=711 0 0
lxcfs /proc/cpuinfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/diskstats fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/loadavg fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/meminfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/swaps fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/uptime fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /sys/devices/system/cpu/online fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
tmpfs /proc/driver/nvidia tmpfs rw,nosuid,nodev,noexec,relatime,mode=555,uid=1000000,gid=1000000 0 0
/dev/sda2 /usr/bin/nvidia-smi ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/bin/nvidia-debugdump ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libcuda.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/firmware/nvidia/470.103.01/gsp.bin ext4 ro,nosuid,nodev,relatime 0 0
udev /dev/nvidiactl devtmpfs ro,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/nvidia-uvm devtmpfs ro,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/nvidia-uvm-tools devtmpfs ro,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/full devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/null devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/random devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/tty devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/urandom devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/zero devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
/proc/self/fd/43 /dev/pts devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0
/proc/self/fd/43 /dev/ptmx devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0
/proc/self/fd/43 /dev/console devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0
none /proc/sys/kernel/random/boot_id tmpfs ro,nosuid,nodev,noexec,relatime,size=492k,mode=755,uid=1000000,gid=1000000 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,uid=1000000,gid=1000000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,size=1628708k,nr_inodes=819200,mode=755,uid=1000000,gid=1000000 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,uid=1000000,gid=1000000 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,uid=1000000,gid=1000000 0 0
cgroup2 /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset,clone_children 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
/dev/sda2 /dev/dri/card0 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/renderD128 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/card1 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/renderD129 ext4 rw,relatime 0 0
/dev/sda2 /dev/nvidia0 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/card2 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/renderD130 ext4 rw,relatime 0 0
/dev/sda2 /dev/nvidia1 ext4 rw,relatime 0 0
stgraber commented 2 years ago
root@vm12:~# dpkg -l | grep -i nvidia
ii  libnvidia-compute-470:amd64                470.103.01-0ubuntu0.20.04.1           amd64        NVIDIA libcompute package
ii  linux-modules-nvidia-470-5.4.0-100-generic 5.4.0-100.113                         amd64        Linux kernel nvidia modules for version 5.4.0-100
ii  linux-modules-nvidia-470-generic           5.4.0-100.113                         amd64        Extra drivers for nvidia-470 for the generic flavour
ii  linux-objects-nvidia-470-5.4.0-100-generic 5.4.0-100.113                         amd64        Linux kernel nvidia modules for version 5.4.0-100 (objects)
ii  linux-signatures-nvidia-5.4.0-100-generic  5.4.0-100.113                         amd64        Linux kernel signatures for nvidia modules for version 5.4.0-100-generic
ii  nvidia-kernel-common-470                   470.103.01-0ubuntu0.20.04.1           amd64        Shared files used with the kernel module
ii  nvidia-utils-470                           470.103.01-0ubuntu0.20.04.1           amd64        NVIDIA driver support binaries
root@vm12:~# 
trippwill commented 2 years ago

My host is debian bullseye, with the nvidia drivers and cuda installed from the nvidia apt network repo using their instructions.

My storage driver is zfs

I've triple confirmed no nouvea.

I'm remembering that I did a runfile installation on the host some time ago. I'll follow the nvidia instructions to do a complete uninstall of both the packaged driver and prior runfile.

trippwill commented 2 years ago

No joy. Same results whether I do pkg installation or runtime installation. The host detects the gpus just fine. But the container just can't locate the required libraries.

Short of changing the host to Ubuntu, I don't what else to try.

trippwill commented 2 years ago

This looks like an nvidia-container-toolkit on Debian11 kind of issue. Same result with a podman container:

podman run --rm --security-opt=label=disable \

 --hooks-dir=/usr/share/containers/oci/hooks.d/ \
 nvidia/cuda:11.0-base nvidia-smi

✔ docker.io/nvidia/cuda:11.0-base Trying to pull docker.io/nvidia/cuda:11.0-base... Getting image source signatures Copying blob b66c17bbf772 done Copying blob 54ee1f796a1e done Copying blob 46d371e02073 done Copying blob f7bfea53ad12 done Copying blob e5ce55b8b4b9 done Copying blob 3642f1a6dfb3 done Copying blob 155bc0332b0a done Copying config 2ec708416b done Writing manifest to image destination Storing signatures NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.

trippwill commented 2 years ago

FYI - I got this working for a podman container running debian but only by running it rootful and privileged

stgraber commented 2 years ago

That's pretty odd. Could you try running strace -f nvidia-smi in the container, see what may be going on there?

trippwill commented 2 years ago

It's really odd. I created a new container with all flags needed.

I ran nvidia-smi, and I got the same error about the library.

I did an apt update (not upgrade) and installed strace

ran 'strace nvidia-smi' got output!

after running with strace, now nvidia-smi is fine

trippwill commented 2 years ago
  1. Start container with nvidia oci hooks
  2. Exec -it bash
  3. nvidia-smi: error
  4. apt update
  5. nvidia-smi: error
  6. apt install strace
  7. nvidia smi works!

there seems to be a dependency on strace which isn't installed by default in debian container images, now I'm wondering if I install strace on the host if that will fix it?

trippwill commented 2 years ago

Did a quick test, strace on the host is not enough. Has to be installed in the container.

stgraber commented 2 years ago

Ah, can you get a broken container and run ldconfig inside of it, see if that fixes it?

trippwill commented 2 years ago

$ lxc exec fred ldconfig ldconfig: File /lib/x86_64-linux-gnu/libnvidia-cfg.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-allocator.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-opencl.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-compiler.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-ml.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libcuda.so.470.57.02 is empty, not checked.

$ lxc exec fred nvidia-smi Fri Feb 25 15:40:55 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:25:00.0 N/A | N/A | | N/A 42C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Quadro P620 Off | 00000000:27:00.0 Off | N/A | | 32% 43C P0 N/A / N/A | 0MiB / 2000MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

:-)

stgraber commented 2 years ago

I wonder if it's just an issue of the linker cache being wrong somehow and installing something in the container triggered an ldconfig run which then fixed it.

I seem to remember the nvidia-container stuff having quite a bit of smart around ldconfig but they need to use the host's ldconfig as the container's can't be trusted. This may explain why Ubuntu on the host works as that's likely what they actively test on but Debian's version may be different and not work properly.

stgraber commented 2 years ago

Alright, well, sounds like we found the culprit :)

It's not something we can really do anything about in LXD because similarly to nvidia-container, we can't trust the containers and so wouldn't want to directly call any of its binaries to workaround this issue. At least you have a quick and easy workaround for this one!