Closed trippwill closed 2 years ago
Can you try with lxc launch images:ubuntu/20.04 foo -c nvidia.runtime=true
instead and see if that gives you the same result?
If it does, can you show cat /proc/mounts
from inside the container?
Googling the error it looks like it may have to do with the linker inside the container not working quite the same way as the linker on the host causing some issues, but it could also be that nvidia-container somehow didn't pass in that particular library for some reason.
Should have mentioned I already tried it with ubuntu, but gave it another go just now. Same results:
$ lxc launch images:ubuntu/20.04 foo -c nvidia.runtime=true Creating foo Starting foo
$ lxc exec foo -- nvidia-smi NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.
$ lxc exec foo -- cat /proc/mounts tank/lxd-default/containers/foo / zfs rw,relatime,xattr,posixacl 0 0 none /dev tmpfs rw,relatime,size=492k,mode=755,uid=1000000,gid=1000000 0 0 proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0 sysfs /sys sysfs rw,relatime 0 0 udev /dev/fuse devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/net/tun devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0 efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0 fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0 pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0 configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0 debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0 tracefs /sys/kernel/debug/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0 securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0 tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0 mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0 tmpfs /dev/lxd tmpfs rw,relatime,size=100k,mode=755 0 0 tmpfs /dev/.lxd-mounts tmpfs rw,relatime,size=100k,mode=711 0 0 none /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0 lxcfs /proc/cpuinfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/diskstats fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/loadavg fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/meminfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/swaps fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /proc/uptime fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 lxcfs /sys/devices/system/cpu/online fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0 tmpfs /proc/driver/nvidia tmpfs rw,nosuid,nodev,noexec,relatime,mode=555,uid=1000000,gid=1000000 0 0 /dev/sda2 /usr/bin/nvidia-smi ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/bin/nvidia-debugdump ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/bin/nvidia-persistenced ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/bin/nvidia-cuda-mps-control ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/bin/nvidia-cuda-mps-server ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libcuda.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.103.01 ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 /dev/sda2 /usr/lib/firmware/nvidia/470.103.01/gsp.bin ext4 ro,nosuid,nodev,relatime,errors=remount-ro 0 0 tmpfs /run/nvidia-persistenced/socket tmpfs rw,nosuid,nodev,noexec,relatime,size=3286584k,mode=755 0 0 udev /dev/nvidiactl devtmpfs ro,nosuid,noexec,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/nvidia-uvm devtmpfs ro,nosuid,noexec,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/nvidia-uvm-tools devtmpfs ro,nosuid,noexec,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/full devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/null devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/random devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/tty devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/urandom devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 udev /dev/zero devtmpfs rw,nosuid,relatime,size=16380460k,nr_inodes=4095115,mode=755 0 0 devpts /dev/pts devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0 devpts /dev/ptmx devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0 devpts /dev/console devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0 none /proc/sys/kernel/random/boot_id tmpfs ro,nosuid,nodev,noexec,relatime,size=492k,mode=755,uid=1000000,gid=1000000 0 0 tmpfs /dev/shm tmpfs rw,nosuid,nodev,uid=1000000,gid=1000000 0 0 tmpfs /run tmpfs rw,nosuid,nodev,size=3286584k,mode=755,uid=1000000,gid=1000000 0 0 tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,uid=1000000,gid=1000000 0 0
I do wonder if it's related to the fact I'm on 470 drivers and cuda 11.4. Unfortunately the GT710 is only supported up to the 470 drivers.
The library appears to be passed through properly. Can you show:
ldd /usr/bin/nvidia-smi
ldd /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470
$ ldd /usr/bin/nvidia-smi linux-vdso.so.1 (0x00007ffe715b8000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fac92279000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fac92273000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fac92081000) librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fac92076000) /lib64/ld-linux-x86-64.so.2 (0x00007fac922a1000)
$ ldd /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470 ldd: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470: No such file or directory
$ ldd /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 linux-vdso.so.1 (0x00007ffea41d3000) libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007ffb739ce000) libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007ffb739c8000) libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007ffb737d6000) /lib64/ld-linux-x86-64.so.2 (0x00007ffb7408e000)
Don't know if this is relevant: root@foo:/# find -L / ( -type d -name proc -o -type d -name sys ) -prune -o -type f -name libnvidia-ml* /sys /proc find: File system loop detected; ‘/dev/fd/3’ is part of the same file system loop as ‘/’. find: ‘/dev/fd/4’: No such file or directory find: ‘/dev/.lxd-mounts’: Permission denied /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 /lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01
Taking a look at this now in our test system which also uses GT710 or similarly old GPU.
My setup is Ubuntu 20.04 server with the stock 5.4 kernel using:
apt-get update
apt-get dist-upgrade --yes
apt-get install linux-generic --yes
apt-get remove --purge --yes linux.*hwe.* --yes
apt-get install nvidia-utils-470 linux-modules-nvidia-470-generic libnvidia-compute-470 --yes
echo "blacklist nouveau" > /etc/modprobe.d/blacklist-nouveau.conf
echo "options nouveau modeset=0" >> /etc/modprobe.d/blacklist-nouveau.conf
sudo update-initramfs -u
After doing that and rebooting:
root@vm12:~# nvidia-smi
Thu Feb 24 18:42:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:06:00.0 N/A | N/A |
| 56% 64C P0 N/A / N/A | 0MiB / 2002MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:07:00.0 N/A | N/A |
| 54% 61C P0 N/A / N/A | 0MiB / 2002MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@vm12:~# snap refresh lxd --channel=latest
lxd 4.23 from Canonical✓ refreshed
root@vm12:~# lxd init --auto
root@vm12:~#
And now testing a container:
root@vm12:~# lxc launch images:debian/bullseye fred -c nvidia.runtime=true
Creating fred
Starting fred
root@vm12:~# lxc config device add fred gpu gpu
Device gpu added to fred
root@vm12:~# lxc exec fred -- nvidia-smi
Thu Feb 24 18:45:13 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:06:00.0 N/A | N/A |
| 36% 55C P0 N/A / N/A | 0MiB / 2002MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... Off | 00000000:07:00.0 N/A | N/A |
| 35% 52C P0 N/A / N/A | 0MiB / 2002MiB | N/A Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@vm12:~#
Mount table here looks like this:
root@vm12:~# lxc exec fred -- cat /proc/mounts
/dev/sda2 / ext4 rw,relatime 0 0
none /dev tmpfs rw,relatime,size=492k,mode=755,uid=1000000,gid=1000000 0 0
proc /proc proc rw,nosuid,nodev,noexec,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
udev /dev/fuse devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/net/tun devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
binfmt_misc /proc/sys/fs/binfmt_misc binfmt_misc rw,nosuid,nodev,noexec,relatime 0 0
efivarfs /sys/firmware/efi/efivars efivarfs rw,nosuid,nodev,noexec,relatime 0 0
fusectl /sys/fs/fuse/connections fusectl rw,nosuid,nodev,noexec,relatime 0 0
pstore /sys/fs/pstore pstore rw,nosuid,nodev,noexec,relatime 0 0
configfs /sys/kernel/config configfs rw,nosuid,nodev,noexec,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,nosuid,nodev,noexec,relatime 0 0
securityfs /sys/kernel/security securityfs rw,nosuid,nodev,noexec,relatime 0 0
tracefs /sys/kernel/tracing tracefs rw,nosuid,nodev,noexec,relatime 0 0
mqueue /dev/mqueue mqueue rw,nosuid,nodev,noexec,relatime 0 0
tmpfs /dev/lxd tmpfs rw,relatime,size=100k,mode=755 0 0
tmpfs /dev/.lxd-mounts tmpfs rw,relatime,size=100k,mode=711 0 0
lxcfs /proc/cpuinfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/diskstats fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/loadavg fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/meminfo fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/stat fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/swaps fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /proc/uptime fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
lxcfs /sys/devices/system/cpu/online fuse.lxcfs rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other 0 0
tmpfs /proc/driver/nvidia tmpfs rw,nosuid,nodev,noexec,relatime,mode=555,uid=1000000,gid=1000000 0 0
/dev/sda2 /usr/bin/nvidia-smi ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/bin/nvidia-debugdump ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libcuda.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.470.103.01 ext4 ro,nosuid,nodev,relatime 0 0
/dev/sda2 /usr/lib/firmware/nvidia/470.103.01/gsp.bin ext4 ro,nosuid,nodev,relatime 0 0
udev /dev/nvidiactl devtmpfs ro,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/nvidia-uvm devtmpfs ro,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/nvidia-uvm-tools devtmpfs ro,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/full devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/null devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/random devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/tty devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/urandom devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
udev /dev/zero devtmpfs rw,nosuid,noexec,relatime,size=4027396k,nr_inodes=1006849,mode=755 0 0
/proc/self/fd/43 /dev/pts devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0
/proc/self/fd/43 /dev/ptmx devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0
/proc/self/fd/43 /dev/console devpts rw,nosuid,noexec,relatime,gid=1000005,mode=620,ptmxmode=666,max=1024 0 0
none /proc/sys/kernel/random/boot_id tmpfs ro,nosuid,nodev,noexec,relatime,size=492k,mode=755,uid=1000000,gid=1000000 0 0
tmpfs /dev/shm tmpfs rw,nosuid,nodev,uid=1000000,gid=1000000 0 0
tmpfs /run tmpfs rw,nosuid,nodev,size=1628708k,nr_inodes=819200,mode=755,uid=1000000,gid=1000000 0 0
tmpfs /run/lock tmpfs rw,nosuid,nodev,noexec,relatime,size=5120k,uid=1000000,gid=1000000 0 0
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,size=4096k,nr_inodes=1024,mode=755,uid=1000000,gid=1000000 0 0
cgroup2 /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset,clone_children 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
/dev/sda2 /dev/dri/card0 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/renderD128 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/card1 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/renderD129 ext4 rw,relatime 0 0
/dev/sda2 /dev/nvidia0 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/card2 ext4 rw,relatime 0 0
/dev/sda2 /dev/dri/renderD130 ext4 rw,relatime 0 0
/dev/sda2 /dev/nvidia1 ext4 rw,relatime 0 0
root@vm12:~# dpkg -l | grep -i nvidia
ii libnvidia-compute-470:amd64 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA libcompute package
ii linux-modules-nvidia-470-5.4.0-100-generic 5.4.0-100.113 amd64 Linux kernel nvidia modules for version 5.4.0-100
ii linux-modules-nvidia-470-generic 5.4.0-100.113 amd64 Extra drivers for nvidia-470 for the generic flavour
ii linux-objects-nvidia-470-5.4.0-100-generic 5.4.0-100.113 amd64 Linux kernel nvidia modules for version 5.4.0-100 (objects)
ii linux-signatures-nvidia-5.4.0-100-generic 5.4.0-100.113 amd64 Linux kernel signatures for nvidia modules for version 5.4.0-100-generic
ii nvidia-kernel-common-470 470.103.01-0ubuntu0.20.04.1 amd64 Shared files used with the kernel module
ii nvidia-utils-470 470.103.01-0ubuntu0.20.04.1 amd64 NVIDIA driver support binaries
root@vm12:~#
My host is debian bullseye, with the nvidia drivers and cuda installed from the nvidia apt network repo using their instructions.
My storage driver is zfs
I've triple confirmed no nouvea.
I'm remembering that I did a runfile installation on the host some time ago. I'll follow the nvidia instructions to do a complete uninstall of both the packaged driver and prior runfile.
No joy. Same results whether I do pkg installation or runtime installation. The host detects the gpus just fine. But the container just can't locate the required libraries.
Short of changing the host to Ubuntu, I don't what else to try.
This looks like an nvidia-container-toolkit on Debian11 kind of issue. Same result with a podman container:
--hooks-dir=/usr/share/containers/oci/hooks.d/ \
nvidia/cuda:11.0-base nvidia-smi
✔ docker.io/nvidia/cuda:11.0-base Trying to pull docker.io/nvidia/cuda:11.0-base... Getting image source signatures Copying blob b66c17bbf772 done Copying blob 54ee1f796a1e done Copying blob 46d371e02073 done Copying blob f7bfea53ad12 done Copying blob e5ce55b8b4b9 done Copying blob 3642f1a6dfb3 done Copying blob 155bc0332b0a done Copying config 2ec708416b done Writing manifest to image destination Storing signatures NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system. Please also try adding directory that contains libnvidia-ml.so to your system PATH.
FYI - I got this working for a podman container running debian but only by running it rootful and privileged
That's pretty odd. Could you try running strace -f nvidia-smi
in the container, see what may be going on there?
It's really odd. I created a new container with all flags needed.
I ran nvidia-smi, and I got the same error about the library.
I did an apt update (not upgrade) and installed strace
ran 'strace nvidia-smi' got output!
after running with strace, now nvidia-smi is fine
there seems to be a dependency on strace which isn't installed by default in debian container images, now I'm wondering if I install strace on the host if that will fix it?
Did a quick test, strace on the host is not enough. Has to be installed in the container.
Ah, can you get a broken container and run ldconfig
inside of it, see if that fixes it?
$ lxc exec fred ldconfig ldconfig: File /lib/x86_64-linux-gnu/libnvidia-cfg.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-allocator.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-opencl.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-compiler.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-ml.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.57.02 is empty, not checked. ldconfig: File /lib/x86_64-linux-gnu/libcuda.so.470.57.02 is empty, not checked.
$ lxc exec fred nvidia-smi Fri Feb 25 15:40:55 2022 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA GeForce ... Off | 00000000:25:00.0 N/A | N/A | | N/A 42C P0 N/A / N/A | 0MiB / 1999MiB | N/A Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Quadro P620 Off | 00000000:27:00.0 Off | N/A | | 32% 43C P0 N/A / N/A | 0MiB / 2000MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
:-)
I wonder if it's just an issue of the linker cache being wrong somehow and installing something in the container triggered an ldconfig
run which then fixed it.
I seem to remember the nvidia-container stuff having quite a bit of smart around ldconfig but they need to use the host's ldconfig as the container's can't be trusted. This may explain why Ubuntu on the host works as that's likely what they actively test on but Debian's version may be different and not work properly.
Alright, well, sounds like we found the culprit :)
It's not something we can really do anything about in LXD because similarly to nvidia-container, we can't trust the containers and so wouldn't want to directly call any of its binaries to workaround this issue. At least you have a quick and easy workaround for this one!
Required information
Issue description
Attempting to create a new container with nvidia gpu passthrough.
I was following this tutorial (with a debian image instead of ubuntu): https://ubuntu.com/tutorials/gpu-data-processing-inside-lxd#1-overview
actual: lxc exec fred -- nvidia-smi fails with error message indicating a library could not be found
expected: output from nvidia-smi from the container
Steps to reproduce
+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Result is the same doing it as root, or using an ubuntu image for the container
Saw this issue:https://github.com/lxc/lxd/issues/7840 but unmounting /proc/driver/nvidia inside the container didn't make a difference.
Information to attach
dmesg
)lxc info NAME --show-log
)lxc config show NAME --expanded
)lxc monitor
while reproducing the issue)fred-log.txt fred-configuration.txt lxd-log.txt fred-debug-output.txt