Closed Ixian closed 1 year ago
@Talung I don't understand the issue.
but obviously no nvidia docker runtime.
What does that mean?
Sounds like everything is working in Ubuntu for you (including nvidia drivers and docker). If that's the case and the issue is you can't get the debian image to work then please open a new issue and post the errors you're running into. Perhaps you could also remove the hidden lxc directory (check with ls -la
in the script directory) to ensure the debian image will be freshly downloaded.
@Talung I don't understand the issue.
but obviously no nvidia docker runtime.
Sorry didn't make myself clear. ie, when running the docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
command it specifies --runtime=nvidia
which doesn't exist until you add the nvidia-container-toolkit and update the daemon.json.
I will start another debian image, but I remember not getting any access to nvidia-smi
which was available in the ubuntu version. And it could be and old image which I will attempt again.
Will let you know the results.
UPDATE: It was the cache. As soon as I cleared it and created a new image, nvidia-smi
was there.
UPDATE2: fully works in debian11 now.
root@debianjail:~# docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.6.2-base-ubuntu20.04' locally
11.6.2-base-ubuntu20.04: Pulling from nvidia/cuda
846c0b181fff: Pull complete
b787be75b30b: Pull complete
40a5337e592b: Pull complete
8055c4cd4ab2: Pull complete
a0c882e23131: Pull complete
Digest: sha256:9928940c6e88ed3cdee08e0ea451c082a0ebf058f258f6fbc7f6c116aeb02143
Status: Downloaded newer image for nvidia/cuda:11.6.2-base-ubuntu20.04
Fri Mar 3 13:41:21 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01 Driver Version: 515.65.01 CUDA Version: 11.7 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:03:00.0 Off | N/A |
| 29% 38C P5 20W / 180W | 0MiB / 8192MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
What does nvidia-container-cli list
return, on the host and in the jail?
Try it right after you start the jail and access the shell too.
On the TrueNAS box:
root@truenas[~]# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01
And in the debian machine straight after it starts
root@truenas[/mnt/pond/jailmaker]# machinectl shell debianjail
Connected to machine debianjail. Press ^] three times within 1s to exit session.
root@debianjail:~# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01
In my box I have an old GTX 1070 card. Not everything might be available through there, but Jellyfin etc. is definitely seeing it now.
Would you mind trying a little experiment to see if you can reproduce the problem I'm seeing?
Stop your jail
machinectl stop _yourjailname_
Start it again and check the full output that returns. When I start a jail I get this:
sudo ./jlmkr.py start jaildocker2
Config loaded!
nvidia-container-cli: initialization error: nvml error: driver not loaded
Failed to run nvidia-container-cli.
Unable to detect which nvidia driver files to mount.
Falling back to hard-coded list of nvidia files...
ldconfig: File /lib/x86_64-linux-gnu/libnvidia-ml.so.1 is empty, not checked.
Inside the jail shell I can successfully run nvidia-smi
however nvidia-container-cli list
fails:
nvidia-container-cli list
nvidia-container-cli: initialization error: nvml error: driver not loaded
Only when I bring up my compose stack, which includes Plex, Emby, and Tdarr (all use GPU) does the error go away.
Though it seems like a non-critical error (because eventually, GPU passthrough does work) something is clearly still not working and that is what we are trying to run down. Would be really helpful to confirm, or not, whether this happens to anyone but me (since @Jip-Hop doesn't have an Nvidia GPU to test with).
Ok, so I did what you asked, and I seem to have no issues whatsoever. Here is the outputs:
root@truenas[/mnt/pond/jailmaker]# ./jlmkr.py start debianjail
Config loaded!
Starting jail with the following command:
systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-debianjail --working-directory=./jails/debianjail '--description=My nspawn jail debianjail [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=debianjail --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01 --bind=/dev/nvidia-caps --bind=/dev/nvidiactl --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01 --bind-ro=/usr/bin/nvidia-persistenced --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01 --bind=/dev/nvidia0 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01 --bind-ro=/usr/bin/nvidia-smi --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01 --bind-ro=/usr/lib/nvidia/current/nvidia-smi --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01
Starting jail with name: debianjail
Running as unit: jlmkr-debianjail.service
Check logging:
journalctl -u jlmkr-debianjail
Check status:
systemctl status jlmkr-debianjail
Stop the jail:
machinectl stop debianjail
Get a shell:
machinectl shell debianjail
root@truenas[/mnt/pond/jailmaker]# machinectl shell debianjail
Connected to machine debianjail. Press ^] three times within 1s to exit session.
root@debianjail:~# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01
root@debianjail:~# exit
logout
Connection to machine debianjail terminated.
root@truenas[/mnt/pond/jailmaker]# machinectl stop debianjail
root@truenas[/mnt/pond/jailmaker]# ./jlmkr.py start debianjail
Config loaded!
Starting jail with the following command:
systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-debianjail --working-directory=./jails/debianjail '--description=My nspawn jail debianjail [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=debianjail --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --bind-ro=/usr/bin/nvidia-smi --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01 --bind-ro=/usr/bin/nvidia-persistenced --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01 --bind=/dev/nvidia-caps --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01 --bind-ro=/usr/lib/nvidia/current/nvidia-smi --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01 --bind=/dev/nvidiactl --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01 --bind=/dev/nvidia0 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01
Starting jail with name: debianjail
Running as unit: jlmkr-debianjail.service
Check logging:
journalctl -u jlmkr-debianjail
Check status:
systemctl status jlmkr-debianjail
Stop the jail:
machinectl stop debianjail
Get a shell:
machinectl shell debianjail
root@truenas[/mnt/pond/jailmaker]# machinectl shell debianjail
Connected to machine debianjail. Press ^] three times within 1s to exit session.
root@debianjail:~# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-modeset
/dev/nvidia0
/usr/bin/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01
root@debianjail:~# exit
This is on the Debian11 I just created using the exact same sequence as I did for the Ubuntu machine. Maybe I stumbled on a good installation sequence? I am only really familiar with LXC and containers through the use of proxmox, so this is all fairly new to me. Docker as well for only a few months.
Is this the info you were seeking? Let me know if you want any other tests done.
Thanks @Talung. Looks good!
@Ixian please double check you have te latest script and try a fresh debian jail.
ldconfig: File /lib/x86_64-linux-gnu/libnvidia-ml.so.1 is empty, not checked.
This looks like a file leftover by running a previous version of the script.
And reboot as well please just to rule that out.
Give me a few.. going to Disable the start of the ubuntu jail that runs my main dockers. Then I will reboot, get the latest image (also remove the .lxc cache) and run through my installation scripts.
Will post results after installation, then after a stop and starting of virtual machine. Do you want a another reboot between starts?
@Talung your stuff looks good. Only @Ixian should try those steps :)
@Talung your stuff looks good. Only @Ixian should try those steps :)
oops. Just started doing the stuff again.. won't hurt :D and will confirm it for fresh approach. :)
Starting to look like I broke something with my Scale installation. nvidia-container-cli list
doesn't work inside or out of the jail.
mmh.. very interesting... My experiment showed problems now:
root@debianjail:~# docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.6.2-base-ubuntu20.04' locally
11.6.2-base-ubuntu20.04: Pulling from nvidia/cuda
846c0b181fff: Pull complete
b787be75b30b: Pull complete
40a5337e592b: Pull complete
8055c4cd4ab2: Pull complete
a0c882e23131: Pull complete
Digest: sha256:9928940c6e88ed3cdee08e0ea451c082a0ebf058f258f6fbc7f6c116aeb02143
Status: Downloaded newer image for nvidia/cuda:11.6.2-base-ubuntu20.04
docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: Auto-detected mode as 'legacy'
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
and
root@debianjail:~# nvidia-container-cli list
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
This is after the reboot, and everything clean. The only real difference was that I got the Ubuntu jammy working before I tried the debian again. And now the ubuntu jail is exhibiting the same issues.
Very weird.
EDIT: nvidia-smi is now available. It seems you need to wait some time after TrueNAS boots for that stuff to become active.
Yeah, something still isn't right with how the drivers are being pulled from the host to the jail, but it's tricky trying to run things down. I'm seeing inconsistent results too.
Mine is working again. Going through the stuff this is what I noticed after the reboot:
Starting jail with the following command:
systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStat us=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-debianjail --working-directory=./jails/debianjail '--description=My nspawn jail debianjail [created with jailmaker]' --setenv=SYSTEMD_SE CCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=debianjail --directory=rootfs --c apability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri
Starting jail with name: debianjail
This one failed. Then I was looking around the machine for a reason turning on my original ubuntuDocker which also failed. Deleted the new debian one and created the jail again, but this time the command came back with:
Starting jail with the following command:
systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-debianjail --working-directory=./jails/debianjail '--description=My nspawn jail debianjail [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=debianjail --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --bind=/dev/nvidia-caps --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01 --bind-ro=/usr/lib/nvidia/current/nvidia-smi --bind=/dev/nvidiactl --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01 --bind=/dev/nvidia0 --bind-ro=/usr/bin/nvidia-persistenced --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01 --bind-ro=/usr/bin/nvidia-smi
Starting jail with name: debianjail
What it looks like to me is some sort of timing issue. Its almost as if TrueNAS needs to "settle" itself before cards become available.
EDIT: stopping and restarting the ubuntu jail is now working as expected.
What I'm finding now is the card needs to "initialized" somehow either host or inside the jail.
For example, nvidia-container-cli list will fail (host or inside jail) but nvidia-smi works and once that is run then nvidia-container-cli list works. This is true whether you do this on the host or in the jail.
The problem is this isn't how it used to work. I'm wondering if we're accidentally modifying system files from the jail.
When I did my reboot and rebuild, I did run a nvidia-smi on the root system before starting the builds. I did not run a nvidia-container-cli before. I could try this tomorrow, as I am about to go to sleep.
Maybe we need to enable the nvidia-persistencd service as suggested by @TrueJournals before running nvidia-container-cli
?
The nvidia-persistenced utility is used to enable persistent software state in the NVIDIA driver. When persistence mode is enabled, the daemon prevents the driver from releasing device state when the device is not in use. This can improve the startup time of new clients in this scenario. Source.
Latest script retires nvidia-container-cli
3 times in case it fails. Maybe that helps?
Also, the gpu_passthrough
config value is deprecated in favor of gpu_passthrough_nvidia
and gpu_passthrough_intel
. During jail creation you'll be asked about both in case the GPUs are detected. The new script won't write the gpu_passthrough
config value for new jails. If it reads gpu_passthrough
from the config it will try to passthrough both intel and nvidia like it currently does.
I'd take the retry out - it won't do anything.
What I'm doing at the moment is just using a simple shell script to start the jail:
!/usr/bin/env bash
sleep 5
nvidia-smi -f /dev/null
sleep 5
/mnt/ssd-storage/jailmaker/jlmkr.py start debianjail
Crude, but works.
Check this out: setup_nvidia_gpu.
Seems like TrueNAS doesn't full load/init the GPU by default?
# We install the nvidia-kernel-dkms package which causes a modprobe file to be written
# (i.e /etc/modprobe.d/nvidia.conf). This file tries to modprobe all the associated
# nvidia drivers at boot whether or not your system has an nvidia card installed.
# For all truenas certified and truenas enterprise hardware, we do not include nvidia GPUS.
# So to prevent a bunch of systemd "Failed" messages to be barfed to the console during boot,
# we remove this file because the linux kernel dynamically loads the modules based on whether
# or not you have the actual hardware installed in the system.
with contextlib.suppress(FileNotFoundError):
os.unlink(os.path.join(CHROOT_BASEDIR, 'etc/modprobe.d/nvidia.conf'))
Excerpt from the truenas scale-build repo.
Perhaps instead of calling nvidia-smi
we should run:
modprobe nvidia-current-uvm
nvidia-modprobe -c0 -u
Sounds like this is what we're running into: https://www.reddit.com/r/qnap/comments/s7bbv6/fix_for_missing_nvidiauvm_device_devnvidiauvm/
Once we get the startup streamlined, we should test for endurance:
Sometimes, after the host has been up for a long time, the /dev/nvidia-uvm or other device nodes may disappear. In this case, simply run the nvidia-uvm-init script, perhaps schedule it to run as a cron job. Source.
I'd take the retry out - it won't do anything.
Retry is out. modprobe nvidia-current-uvm
and nvidia-modprobe -c0 -u
is in. Please try latest script 🙂
I'm sure it won't hurt, but I already knew about that problem and had the modprobe command running as a pre-init. I still had the problem but perhaps it would be better to have the jail script run it instead, I'll give it a try.
I just tried a reboot with the dockerjail starting as a post init script. this is with the updated script downloaded. Unfortunately, anything with the runtime: nvidia
didn't start meaning passthrough did not happen. Manually stopping and starting it does work now. I did run a nvidia-smi
before I did that to make sure truenas picked it up.
Just fyi.
Any chance you could post the logs of the jailmaker script when it is starting dockerjail after a reboot?
You may need to redirect the output somewhere with >
or mail them by piping the output like so: ./jlmkr.py start dockerjail | mail -s "Jailmaker" "youremail@example.com"
Or you could temporarily disable the startup script and run jlmkr manually after the reboot.
I'm tempted to just call nvidia-smi
once before nvidia-container-toolkit list
just to be done with it.
Sure, no problem. Here is the loadlog
root@truenas[/mnt/pond/jailmaker]# cat loadlog.log
Config loaded!
Starting jail with the following command:
systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-dockerjail --working-directory=./jails/dockerjail '--description=My nspawn jail dockerjail [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=dockerjail --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --bind=/mnt/pond/dockerset --bind=/mnt/pond/appdata/ --bind=/mnt/lake/media/ --bind=/mnt/lake/cloud/
Starting jail with name: dockerjail
Check logging:
journalctl -u jlmkr-dockerjail
Check status:
systemctl status jlmkr-dockerjail
Stop the jail:
machinectl stop dockerjail
Get a shell:
machinectl shell dockerjail
There is no nvidia stuff in there. And here is the log after stopping and starting. In between ran nvidia-smi and nvidia-container-cli list
root@truenas[/mnt/pond/jailmaker]# machinectl stop dockerjail
root@truenas[/mnt/pond/jailmaker]# ./jlmkr.py start dockerjail
Config loaded!
Starting jail with the following command:
systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-dockerjail --working-directory=./jails/dockerjail '--description=My nspawn jail dockerjail [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=dockerjail --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01 --bind=/dev/nvidia-uvm-tools --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01 --bind=/dev/nvidiactl --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01 --bind=/dev/nvidia-uvm --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01 --bind-ro=/usr/bin/nvidia-persistenced --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01 --bind=/dev/nvidia-caps --bind-ro=/usr/lib/nvidia/current/nvidia-smi --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01 --bind=/dev/nvidia0 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01 --bind-ro=/usr/bin/nvidia-smi --bind=/mnt/pond/dockerset --bind=/mnt/pond/appdata/ --bind=/mnt/lake/media/ --bind=/mnt/lake/cloud/
Starting jail with name: dockerjail
Running as unit: jlmkr-dockerjail.service
Check logging:
journalctl -u jlmkr-dockerjail
Check status:
systemctl status jlmkr-dockerjail
Stop the jail:
machinectl stop dockerjail
Get a shell:
machinectl shell dockerjail
My post init command is as follows:
/mnt/pond/jailmaker/jlmkr.py start dockerjail > /mnt/pond/jailmaker/loadlog.log
Unfortunately I won't be able to do a lot more testing for the next week as packing up the PC's soon and moving. Hopefully next week Thursday will have most of the stuff up and running so I can do more testing.
Thanks @Talung. Was that with the latest script? I was expecting to see "No nvidia GPU seems to be present... Skip passthrough of nvidia GPU." in the first case.
But I think it's clear that /dev/nvidia*
don't exist yet that soon after boot so I can't rely on that to detect if an nvidia GPU is installed.
Yes. This morning read all the other posts, then ran the update and did a reboot. So unless the script has changed in the last 7 hours, that should be the latest script. Maybe add a little version number to the output so we can confirm that sort of thing. Also, a "run log" where you store the config would also be good for debugging.
Just suggestions. :)
Thanks @Talung. Versioning has started. We're at v0.0.1.
If anyone could test the following sequence:
Was the jail started with nvidia gpu passthrough working (without manually running nvidia-smi or modprobe)?
Did you change anything else on the script besides the versioning? Was going through those tests you suggested, got the latest script (with version numbers), disabled the post init run (but actually I didn't because I didn't hit the save button) and rebooted.
Did the whole setup:
root@truenas[~]# uptime
18:29:49 up 1 min, 1 user, load average: 7.09, 1.97, 0.68
root@truenas[~]# cd /mnt/pond/jailmaker
root@truenas[/mnt/pond/jailmaker]# ./jlmkr.py create testjail
USE THIS SCRIPT AT YOUR OWN RISK!
IT COMES WITHOUT WARRANTY AND IS NOT SUPPORTED BY IXSYSTEMS.
Install the recommended distro (Debian 11)? [Y/n]
Enter jail name: testjail
Docker won't be installed by jlmkr.py.
But it can setup the jail with the capabilities required to run docker.
You can turn DOCKER_COMPATIBLE mode on/off post-install.
Make jail docker compatible right now? [y/N] y
Detected the presence of an intel GPU.
Passthrough the intel GPU? [y/N] y
Detected the presence of an nvidia GPU.
Passthrough the nvidia GPU? [y/N] y
WARNING: CHECK SYNTAX
You may pass additional flags to systemd-nspawn.
With incorrect flags the jail may not start.
It is possible to correct/add/remove flags post-install.
Show the man page for systemd-nspawn? [y/N]
You may read the systemd-nspawn manual online:
https://manpages.debian.org/bullseye/systemd-container/systemd-nspawn.1.en.html
For example to mount directories inside the jail you may add:
--bind='/mnt/data/a writable directory/' --bind-ro='/mnt/data/a readonly directory/'
Additional flags:
Using image from local cache
Unpacking the rootfs
---
You just created a Debian bullseye amd64 (20230303_05:25) container.
To enable SSH, run: apt install openssh-server
No default root or user password are set by LXC.
Do you want to start the jail? [Y/n] y
Config loaded!
Starting jail with the following command:
systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-testjail --working-directory=./jails/testjail '--description=My nspawn jail testjail [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=testjail --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --bind-ro=/usr/lib/nvidia/current/nvidia-smi --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01 --bind-ro=/usr/bin/nvidia-persistenced --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01 --bind-ro=/usr/bin/nvidia-smi --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01 --bind=/dev/nvidia-caps --bind=/dev/nvidia0 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01 --bind=/dev/nvidiactl
Starting jail with name: testjail
Running as unit: jlmkr-testjail.service
Check logging:
journalctl -u jlmkr-testjail
Check status:
systemctl status jlmkr-testjail
Stop the jail:
machinectl stop testjail
Get a shell:
machinectl shell testjail
And then noticed I had an email from watchtower, which I then realised I didn't save the "disabled" change. However, this time the GPU iniitialised on boot. Here is the log:
root@truenas[/mnt/pond/jailmaker]# cat loadlog.log
Config loaded!
Starting jail with the following command:
systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-dockerjail --working-directory=./jails/dockerjail '--description=My nspawn jail dockerjail [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=dockerjail --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01 --bind-ro=/usr/lib/nvidia/current/nvidia-smi --bind-ro=/usr/bin/nvidia-smi --bind=/dev/nvidia0 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01 --bind=/dev/nvidiactl --bind-ro=/usr/bin/nvidia-persistenced --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01 --bind=/dev/nvidia-caps --bind-ro=/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01 --bind-ro=/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01 --bind=/mnt/pond/dockerset --bind=/mnt/pond/appdata/ --bind=/mnt/lake/media/ --bind=/mnt/lake/cloud/
Starting jail with name: dockerjail
Looking at the commit history, I see some other changes were made and whatever it was, it seems to have worked.
EDIT: for funsies I did another reboot and guess what... GPU was in jail again!
Sounds good! Thanks @Talung
Yes I did more then increment the version number hehe ^^
Detected the presence of an nvidia GPU.
Passthrough the nvidia GPU? [y/N] y
This looks good, as it detected nvidia GPU straight after reboot thanks to nvidia-smi
. No longer depending on /dev/nvidia*
devices to exist.
And then you did another reboot and it ran nvidia-container-toolkit list
successfully because the script now runs nvidia-smi
beforehand (good idea @Ixian).
So seems to be working now!?
Well if working means that I did 2 reboots and GPU came up both times without issue in a jail with GPU passthrough, then I would say: "Yes, it is working!"
Well done!
Grabbed the lastest script and just tried a reboot myself, and I'm definitely running into the linked issue
Everything 'seemed' to be working (nvidia-smi ran successfully in host, jail, and container), but Plex refused to do HW transcoding. I also tried a tensorflow docker container and my GPU wasn't listed.
After poking around a while, I discovered that I didn't have /dev/nvidia-uvm
. The module was loaded, and I even tried unloading and reloading the module. I also tried starting nvidia-persistenced, but nothing seemed to work.
I stopped the jail, ran the mknod for /dev/nvidia-uvm and /dev/nvidia-uvm-tools
D=`grep nvidia-uvm /proc/devices | awk '{print $1}'`
mknod -m 666 /dev/nvidia-uvm c $D 0
mknod -m 666 /dev/nvidia-uvm-tools c $D 0
Then re-started the jail, and transcoding in Plex worked! Tried the tensorflow container again and it listed my GPU.
So it seems like 'something' is still missing to get the nvidia-uvm device created.
Probably worth noting that I'm on TrueNAS scale 22.12.1
It seems that nvidia-modprobe doesn't work because the modules are named nvidia-current-*.ko
instead of just nvidia-*.ko
root@freenas:~# find /lib/modules -name nvidia\*
/lib/modules/5.15.79+truenas/kernel/drivers/net/ethernet/nvidia
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-drm.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-peermem.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-modeset.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-uvm.ko
/lib/modules/5.15.79+truenas/updates/dkms/nvidia-current.ko
But nvidia-modprobe is hard-coded to use nvidia-uvm
as the module name.
I did get nvidia-modprobe to do the right thing by creating a symbolic link and running depmod
ln -s /lib/modules/5.15.79+truenas/updates/dkms/nvidia-current-uvm.ko /lib/modules/5.15.79+truenas/updates/dkms/nvidia-uvm.ko
depmod
nvidia-modprobe -c0 -u
After that, /dev/nvidia-uvm exists.
Since the mknod commands are documented by nvidia, that solution feels a bit less 'hacky'
@TrueJournals I've had to have the following running as a pre-init command since at least 2 Scale releases past:
[ ! -f /dev/nvidia-uvm ] && modprobe nvidia-current-uvm && /usr/bin/nvidia-modprobe -c0 -u
In order to keep the situation you are seeing from happening. That was true even when I was running docker off the Scale host itself. I've had it in there ever since and I haven't had the problem you are seeing.
I think @Jip-Hop added it to the script as well but I believe it is something that needs to happen pre-init if you want your Nvidia GPU to reliably show up in Scale. Something to do with how IX Systems won't load it unless called upon to eliminate boot logging errors. The K3S backed app system handles it behind the scenes when it is used, we need to do it manually.
Thanks for that tip @Ixian ! Looks like that will do it. Quick log from boot (without any special init):
root@freenas:~# ls /dev/nvid*
ls: cannot access '/dev/nvid*': No such file or directory
root@freenas:~# lsmod | grep nvid
nvidia_drm 73728 0
nvidia_modeset 1150976 1 nvidia_drm
nvidia 40853504 1 nvidia_modeset
drm_kms_helper 315392 1 nvidia_drm
drm 643072 4 drm_kms_helper,nvidia,nvidia_drm
root@freenas:~# modprobe nvidia-current-uvm
root@freenas:~# ls /dev/nvid*
ls: cannot access '/dev/nvid*': No such file or directory
root@freenas:~# lsmod | grep nvid
nvidia_uvm 1302528 0
nvidia_drm 73728 0
nvidia_modeset 1150976 1 nvidia_drm
nvidia 40853504 2 nvidia_uvm,nvidia_modeset
drm_kms_helper 315392 1 nvidia_drm
drm 643072 4 drm_kms_helper,nvidia,nvidia_drm
root@freenas:~# nvidia-modprobe -c0 -u
root@freenas:~# ls /dev/nvid*
/dev/nvidia-uvm /dev/nvidia-uvm-tools
root@freenas:~# lsmod | grep nvid
nvidia_uvm 1302528 0
nvidia_drm 73728 0
nvidia_modeset 1150976 1 nvidia_drm
nvidia 40853504 2 nvidia_uvm,nvidia_modeset
drm_kms_helper 315392 1 nvidia_drm
drm 643072 4 drm_kms_helper,nvidia,nvidia_drm
Running nvidia-smi
will then create the /dev/nvidia0
and /dev/nvidiactl
devices.
Looks like the most recent commit removed the modprobe in favor of just running nvidia-smi
So, I guess this is the answer for the TODO @Jip-Hop -- nvidia-smi
is necessary, but not sufficient :) The modprobe and nvidia-modprobe must be run as well.
@Jip-Hop I just went through the latest script (0.0.1 and thanks for adding versioning) and I think it's really coming together, like the changes, learned a few new things about Python too so thanks :)
I'm using 0.0.1 now and so far so good, gone through multiple reboot tests and everything launches clean & my GPU works, I'm able to use hw transcoding in Plex & Tdarr (tested both after each). Haven't seen any other problems (performance, etc.) yet but will keep an eye on things. I think I'm ready to switch over to this full time vs. running docker directly on the host. Famous last words but: Fingers crossed :)
Thanks for that tip @Ixian ! Looks like that will do it. Quick log from boot (without any special init):
Running
nvidia-smi
will then create the/dev/nvidia0
and/dev/nvidiactl
devices.Looks like the most recent commit removed the modprobe in favor of just running
nvidia-smi
So, I guess this is the answer for the TODO @Jip-Hop --
nvidia-smi
is necessary, but not sufficient :) The modprobe and nvidia-modprobe must be run as well.
Yep, I just saw he removed it as well BUT I think that's fine, I am pretty certain the correct order to load the modules during boot is pre-init so probably just an instruction to add it as a pre-init command is enough. That's what we did when we first started running DIY docker with Scale.
Here's a screenshot @Jip-Hop if you want to add it to the readme:
With the pre-init script, things are working -- but it looks like nvidia-container-cli
doesn't work. It seems that 'something' still isn't initialized without running nvidia-smi, but latest script checks for /dev/nvidia-uvm to decide to run nvidia-smi
. Ended up with this error on jlmkr.py start
nvidia-container-cli: initialization error: nvml error: driver not loaded
Unable to detect which nvidia driver files to mount.
Falling back to hard-coded list of nvidia files...
I decided to just add nvidia-smi to my pre-init command. I also thought it might be a good idea to run modprobe-nvidia regardless of if the modprobe nvidia-current-uvm
works (if the module name changes to just nvidia-uvm
in the future...)
I also changed to detect the path to modprobe instead of relying on PATH or on a hard-coded path. Probably not necessary, but I found it interesting.
So, my final pre-init command is:
[ ! -f /dev/nvidia-uvm ] && ( $(cat /proc/sys/kernel/modprobe) nvidia-current-uvm; /usr/bin/nvidia-modprobe -c0 -u; nvidia-smi -f /dev/null )
I had no idea it would take 5 days and about 100 comments to get nvidia passthrough working >.<
Updated the script to v0.0.2. I removed some code I think we no longer need, as long as the pre-init command
command is scheduled (this one or the one above this comment).
Would be great if you could run through the testing sequence again (and run whatever additional tests you think are relevant).
If this works I'll add documentation regarding the pre-init command.
P.S. @TrueJournals if you have an idea how to run ldconfig
inside the jail without having to resort to hardcoding /usr/lib/x86_64-linux-gnu/nvidia/current
and writing a new .conf
file, that would be great. I tried some different things, without success and I'm not to thrilled about the current solution.
P.S. @TrueJournals if you have an idea how to run ldconfig inside the jail without having to resort to hardcoding /usr/lib/x86_64-linux-gnu/nvidia/current and writing a new .conf file, that would be great. I tried some different things, without success and I'm not to thrilled about the current solution.
Alright, you got me curious ;) I dug into this, because I was curious how nvidia handled this. So I dug through libnvidia-container and container-toolkit. Here's what I can tell...
TLDR: They find all unique folders from nvidia-container-cli list
, and create a file in /etc/ld.so.conf.d
based on that.
nvidia has a hard-coded list of libraries in libnvidia-container. Actually, this is multiple lists depending on what capabilities you want in the container. In order to find the full path to these libraries, they parse the ldcache file directly to turn the short library names into full paths. You can see that also in find_library_paths
Over in container-toolkit (which contains the 'hooks' for when containers are created or whatever), there's code to get a list of libraries from "mounts" (a little unclear what these mounts are -- assuming mounts on the container?) by matching paths against lib?*.so*
(syntax for Match). In this same file, they have a function that generates a list of unique folders for this list of files.
Finally, they can create a file in /etc/ld.so.conf.d with a random name that lists all these folders and run ldconfig. It looks like this happens outside the container itself by using the -r
option on ldconfig.
Now, what I'm still a little confused by is that I don't actually see this happening in my docker container. What's also weird is that libraries show up like this:
root@f7ca5192b700:/# ls -al /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
lrwxrwxrwx 1 root root 29 Mar 2 17:14 /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1 -> libnvidia-encode.so.515.65.01
Even though that library is located at /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01
on the "host" (the 'jail' in this case). So it seems like there's another level of remapping and some additional optimization but I'm not quite sure how that works.
Anyway, the logic of 'discover the paths based on the list of libraries' seems reasonable enough. You could even run nvidia-container-cli list --libraries
to get the list of libraries (without binaries and other files) if you didn't want to filter down based on filename patterns.
Thanks for digging into this :)
TLDR: They find all unique folders from nvidia-container-cli list, and create a file in /etc/ld.so.conf.d based on that.
Well, then I will no longer feel bad for writing that file :')
Using the output of nvidia-container-cli list --libraries
to determine the content of our .conf
file sounds like a nice improvement.
By the way, how is v0.0.2 for you? :)
Just tried v0.0.2 and it seems to work fine (I can only reboot my server so many times in a day :laughing: )
Also sent you a PR to implement the above suggesting of discovering library paths based on the output of nvidia-container-cli list --libraries
. Tested with a new and existing jail locally and it seems to behave fine.
We're now on v0.0.3 thanks to @TrueJournals :)
I've added the Pre Init command instructions to the readme.
Looking forward to hearing from @Ixian and @Talung one last time if all is working properly. Hopefully we can soon close this issue.
Updated to 0.0.3, rebooted, all working, Plex hw transcoding working.
Question: Do we need to re-generate a new jail with each version i.e. has the cli launch command in the config file changed? I'm still testing with the jail I created with 0.0.1.
Nice!
The debugging we did with the script may have left some residual files (symlinks, empty folders), so recreating may not be a bad idea.
But in general my intention is that there should not be a need to regenerate a jail when using a newer version of the script.
I'm happy to close this now if you want, I think we've gotten it.
Getting this error:
Looks like everything might not be getting passed through.