Jip-Hop / jailmaker

Persistent Linux 'jails' on TrueNAS SCALE to install software (k3s, docker, portainer, podman, etc.) with full access to all files via bind mounts thanks to systemd-nspawn!
GNU Lesser General Public License v3.0
509 stars 43 forks source link

Nvidia passthrough broken #4

Closed Ixian closed 1 year ago

Ixian commented 1 year ago

Getting this error:


-- WARNING, the following logs are for debugging purposes only --

I0227 16:30:43.055366 3314 nvc.c:376] initializing library context (version=1.12.0, build=7678e1af094d865441d0bc1b97c3e72d15fcab50)
I0227 16:30:43.055432 3314 nvc.c:350] using root /
I0227 16:30:43.055437 3314 nvc.c:351] using ldcache /etc/ld.so.cache
I0227 16:30:43.055442 3314 nvc.c:352] using unprivileged user 65534:65534
I0227 16:30:43.055460 3314 nvc.c:393] attempting to load dxcore to see if we are running under Windows Subsystem for Linux (WSL)
I0227 16:30:43.055577 3314 nvc.c:395] dxcore initialization failed, continuing assuming a non-WSL environment
I0227 16:30:43.057645 3315 nvc.c:278] loading kernel module nvidia
I0227 16:30:43.057787 3315 nvc.c:282] running mknod for /dev/nvidiactl
I0227 16:30:43.057820 3315 nvc.c:286] running mknod for /dev/nvidia0
I0227 16:30:43.057840 3315 nvc.c:290] running mknod for all nvcaps in /dev/nvidia-caps
I0227 16:30:43.063197 3315 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap1 from /proc/driver/nvidia/capabilities/mig/config
I0227 16:30:43.063256 3315 nvc.c:218] running mknod for /dev/nvidia-caps/nvidia-cap2 from /proc/driver/nvidia/capabilities/mig/monitor
I0227 16:30:43.064371 3315 nvc.c:296] loading kernel module nvidia_uvm
I0227 16:30:43.064395 3315 nvc.c:300] running mknod for /dev/nvidia-uvm
I0227 16:30:43.064434 3315 nvc.c:305] loading kernel module nvidia_modeset
I0227 16:30:43.064464 3315 nvc.c:309] running mknod for /dev/nvidia-modeset
I0227 16:30:43.064644 3316 rpc.c:71] starting driver rpc service
I0227 16:30:43.064985 3314 rpc.c:135] driver rpc service terminated with signal 15
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
I0227 16:30:43.065009 3314 nvc.c:434] shutting down library context

Looks like everything might not be getting passed through.

Jip-Hop commented 1 year ago

What does the log say after "Starting jail with the following command:" when you start the jail?

Also what is the output of nvidia-container-cli list on the host?

Thanks for testing an reporting!

Ixian commented 1 year ago
 sudo ./jlmkr.py start dockerjail
Config loaded!

Starting jail with the following command:

systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-dockerjail --working-directory=./jails/dockerjail '--description=My nspawn jail dockerjail [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=dockerjail --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --bind=/mnt/ssd-storage/appdata/ --bind=/mnt/Slimz/

Starting jail with name: dockerjail

Running as unit: jlmkr-dockerjail.service

Check logging:
journalctl -u jlmkr-dockerjail

Check status:
systemctl status jlmkr-dockerjail

Stop the jail:
machinectl stop dockerjail

Get a shell:
machinectl shell dockerjail

and

$ nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-uvm
/dev/nvidia-uvm-tools
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01
Jip-Hop commented 1 year ago

Thanks! I have just updated the python script. Could you try again please?

Ixian commented 1 year ago

Deleted old jail, started over with fresh new jail, getting this error when trying to start:

Do you want to start the jail? [Y/n] Y
Config loaded!
Traceback (most recent call last):
  File "/mnt/ssd-storage/jailmaker/./jlmkr.py", line 666, in <module>
    main()
  File "/mnt/ssd-storage/jailmaker/./jlmkr.py", line 651, in main
    create_jail(args.name)
  File "/mnt/ssd-storage/jailmaker/./jlmkr.py", line 613, in create_jail
    start_jail(jail_name)
  File "/mnt/ssd-storage/jailmaker/./jlmkr.py", line 108, in start_jail
    if subprocess.run(['modprobe', 'br_netfilter']).returncode == 0:
  File "/usr/lib/python3.9/subprocess.py", line 505, in run
    with Popen(*popenargs, **kwargs) as process:
  File "/usr/lib/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.9/subprocess.py", line 1823, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'modprobe'
Ixian commented 1 year ago

More error detail - appears to be an extraneous `--bind-ro==' in the generated CLI now?

Config loaded!

Starting jail with the following command:

systemd-run --property=KillMode=mixed --property=Type=notify --property=RestartForceExitStatus=133 --property=SuccessExitStatus=133 --property=Delegate=yes --property=TasksMax=infinity --collect --setenv=SYSTEMD_NSPAWN_LOCK=0 --unit=jlmkr-gtjail --working-directory=./jails/gtjail '--description=My nspawn jail gtjail [created with jailmaker]' --setenv=SYSTEMD_SECCOMP=0 --property=DevicePolicy=auto -- systemd-nspawn --keep-unit --quiet --boot --machine=gtjail --directory=rootfs --capability=all '--system-call-filter=add_key keyctl bpf' '--property=DeviceAllow=char-drm rw' --bind=/dev/dri --bind=/dev/nvidiactl --bind=/dev/nvidia-uvm --bind=/dev/nvidia-uvm-tools --bind=/dev/nvidia-modeset --bind=/dev/nvidia0 --bind-ro==/usr/lib/nvidia/current/nvidia-smi --bind-ro==/usr/bin/nvidia-persistenced --bind-ro==/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01 --bind-ro==/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01 --bind-ro== --bind=/mnt/ssd-storage/ --bind=/mnt/Slimz/

Starting jail with name: gtjail

Job for jlmkr-gtjail.service failed.
See "systemctl status jlmkr-gtjail.service" and "journalctl -xe" for details.

Failed to start the jail...
In case of a config error, you may fix it with:
nano jails/gtjail/config
Jip-Hop commented 1 year ago

Ah you're right that doesn't look good. If you replace the double == with single ones and run the command itself directly to start the jail, does the Nvidia driver work inside the jail?

If so we know this approach will work and I should fix the double == in the code.

Thanks for helping. Since I don't have Nvidia GPU I couldn't test this part :)

Jip-Hop commented 1 year ago

Should be fixed now.

Ixian commented 1 year ago

Still same problem - I notice you changed this:

systemd_nspawn_additional_args.append(
                        f"--bind-ro={file_path}")

However it still isn't appending {file_path}, it just outputs a blank "--bind-ro=" and that is what stops the jail from starting.

If I remove the blank line I can start the jail however Nvidia drivers still don't appear to work inside it.

Something in the routine you have for mounting the directories using that subroutine to detect /dev or not seems to be broken but I can't see it.

Ixian commented 1 year ago

More info:

The problem I outline above about the extraneous `--bind-ro==' appended to the launch string will prevent the machine from starting, however you can edit around that since it does appear to bind all the other directories, it's just adding that blank one at the end. I am not familiar enough with Python and how it handles loops (other than foreach is implicit) but that is likely simple to fix.

The bigger issue is it's still not passing through everything needed from the host as the following error still happens even when I mod the startup to get the jail running:

root@dockjail:~# nvidia-container-cli list
nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory
Jip-Hop commented 1 year ago

Empty bind-ro line should now be fixed. Thanks! What happens when you run nvidia-smi -a directly inside the jail?

Jip-Hop commented 1 year ago

Also please try these steps inside a fresh jail: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/nvidia-docker.html

Jip-Hop commented 1 year ago

I think running ldconfig once inside the jail may cause the mounted drivers to be detected. https://github.com/NVIDIA/nvidia-docker/issues/854

Ixian commented 1 year ago

Thanks Jip-Hop - the empty bind-ro line is indeed fixed (and I learned something about python today reading your commit) however the Nvidia problems remain. Even running ldconfig in the jail, or in a nvidia container i.e.:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 /bin/bash -c "ldconfig && nvidia-smi"

Still fails with the same error:

nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
Ixian commented 1 year ago

Also, nvidia-container-runtime-list produces a blank output:

nvidia-container-runtime list
ID          PID         STATUS      BUNDLE      CREATED     OWNER

and nvidia-smi isn't available inside the jail at all.

As a sanity check, it does all work outside the jail, I double-checked to make sure I hadn't opened a shell on the wrong machine :)

Jip-Hop commented 1 year ago

Could you try /usr/lib/nvidia/current/nvidia-smi -a inside the jail? Perhaps after running ldconfig once inside the jail. The nvidia-smi binary should be available inside the jail as far as I can tell from the bind mount flags you've posted. It's probably not available in the path so you need to use the absolute path.

Ixian commented 1 year ago
# /usr/lib/nvidia/current/nvidia-smi -a
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Edit: Also ran ldconfig

Jip-Hop commented 1 year ago

I suppose there may still be a (config) file missing in the list of files to bind mount.

This shows the approach should work: https://wiki.archlinux.org/title/systemd-nspawn#Nvidia_GPUs

Maybe something is missing from our list?

Jip-Hop commented 1 year ago

Aha!

Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Ixian commented 1 year ago

libnvidia-ml.so isn't being passed to the jail; find / -name libnvidia-ml.so returns nothing. On the Scale host itself it returns

]# find / -name libnvidia-ml.so
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so

Which doesn't appear to be bound in your script looking at how the directories are enumerated, unless I am missing a piece?

Jip-Hop commented 1 year ago

And if so search with a wildcard at the end? It is being bind mounted but it has a different suffix...

Jip-Hop commented 1 year ago

Maybe I need to do something similar to this:

https://github.com/NVIDIA/nvidia-docker/issues/1163#issuecomment-1075053593

Too bad this needs additional investigation...

Ixian commented 1 year ago

find / -name libnvidia-ml.so

Finds this yes /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01

Jip-Hop commented 1 year ago

O.k. so I now have hard-coded to also mount /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 since it seems this is not listed by nvidia-container-cli but is required for it to work.

I now no longer get the error related to libnvidia-ml.so.1 inside the jail. Now I get this (which I also get on the host so that's probably related to me not having a nvidia GPU).

/usr/lib/nvidia/current/nvidia-smi -a
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

Has this fixed it for you?

Ixian commented 1 year ago

Ah, progress :) Yes, now nvidia-smi picks it up in the jail itself, however it fails inside containers running in the jail. Looks like /usr/lib/nvidia/current needs to be in the system path, imagine that would be better to do with the script?

Ixian commented 1 year ago

Actually, the problem is a little weirder.

I run this (standard test, from the Nvidia site, done it dozens of times):

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

And get this error

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown.

But even adding the correct directory to my path in Bashrc/etc. doesn't fix it. Something else strange is going on here.

Jip-Hop commented 1 year ago

Probably needs to be in path system wide not just for current user?

But nice, progress!

Ixian commented 1 year ago

Something is off with it and it has to be due to how the drivers are pulled in from the host. We still might be missing something.

Jip-Hop commented 1 year ago

When you get to the point that it works inside the jail, but not in a docker container, can you try (after having installed nvidia docker):

docker run --rm --gpus all nvidia/cuda:11.0-base bash -c "ldconfig && nvidia-smi"

Talung commented 1 year ago

Just tried the latest update to test the nvidia part, and am also getting errors starting it. Config file looks fine.

Mar 01 20:51:05 truenas systemd-nspawn[1986823]: Failed to stat /dev/nvidia-modeset: No such file or directory
Mar 01 20:51:05 truenas systemd[1]: jlmkr-dockerjail.service: Main process exited, code=exited, status=1/FAILURE

I can run nvidia-smi

root@truenas[~]# nvidia-smi
Wed Mar  1 20:53:07 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:03:00.0 Off |                  N/A |
| 29%   38C    P5    20W / 180W |      0MiB /  8192MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

and

root@truenas[/mnt/pond/jailmaker]# nvidia-container-cli list
/dev/nvidiactl
/dev/nvidia-modeset
/dev/nvidia0
/usr/lib/nvidia/current/nvidia-smi
/usr/bin/nvidia-persistenced
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01
/usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.515.65.01
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01

so it is there, just not being picked up? I can see if anything is passed through to the jail itself as can't get that running.

Jip-Hop commented 1 year ago

Inside the jail please follow the official steps to get nvidia working with Docker: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/nvidia-docker.html

That would also setup the daemon.json file with nvidia settings.

Then please run ldconfig inside the jail once and then try:

docker run --rm --gpus all nvidia/cuda:11.0-base bash -c "ldconfig && nvidia-smi"

Looking forward to hearing how that goes.

Ixian commented 1 year ago

When you get to the point that it works inside the jail, but not in a docker container, can you try (after having installed nvidia docker):

docker run --rm --gpus all nvidia/cuda:11.0-base bash -c "ldconfig && nvidia-smi"

Has no effect.

The problem now boils down to paths & links that don't match the host.

For example, I couldn't get nvidia-smi to work inside a container because it wasn't being correctly referenced in the jail. I tried creating a symbolic link to it inside /usr/bin (like the Scale host) but then this error came back

# docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.

Likely because now nvidia-smi can't reference the correct files. All of this is boiling down to the fact we're trying to replicate how the Scale host has the drivers installed inside the jail and we're still missing things.

I'm starting to wonder if it wouldn't just be easier to go back to how we used to do it i.e. install the correct matching drivers inside the jail vs. trying to leverage all of that from the host. It's not like Scale has dozens of nvidia-driver updates a year; historically they only update them as part of major updates that come once or twice a year. And no guarantee they won't change something that breaks this method even if we do get it working.

Here is how Scale handles some links/etc. on my system:

 ls -lah /usr/bin/nvidia*
-rwxr-xr-x 1 root root  51K Sep  5 06:52 /usr/bin/nvidia-container-cli
-rwxr-xr-x 1 root root 3.9M Sep  6 02:23 /usr/bin/nvidia-container-runtime
-rwxr-xr-x 1 root root 2.1M Sep  6 02:23 /usr/bin/nvidia-container-runtime-hook
lrwxrwxrwx 1 root root   38 Dec 13 05:45 /usr/bin/nvidia-container-toolkit -> /usr/bin/nvidia-container-runtime-hook
-rwxr-xr-x 1 root root 3.3M Sep  6 02:23 /usr/bin/nvidia-ctk
-rwsr-xr-x 1 root root 174K Jul 21  2022 /usr/bin/nvidia-modprobe
-rwxr-xr-x 1 root root 241K Jul 21  2022 /usr/bin/nvidia-persistenced
lrwxrwxrwx 1 root root   36 Dec 13 05:45 /usr/bin/nvidia-smi -> /etc/alternatives/nvidia--nvidia-smi
ls -lah /etc/alternatives/nvidia*
lrwxrwxrwx 1 root root 23 Dec 13 05:50 /etc/alternatives/nvidia -> /usr/lib/nvidia/current
lrwxrwxrwx 1 root root 59 Dec 13 05:50 /etc/alternatives/nvidia--libGLX_nvidia.so.0-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libGLX_nvidia.so.0
lrwxrwxrwx 1 root root 51 Dec 13 05:50 /etc/alternatives/nvidia--libcuda.so-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so
lrwxrwxrwx 1 root root 53 Dec 13 05:50 /etc/alternatives/nvidia--libcuda.so.1-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libcuda.so.1
lrwxrwxrwx 1 root root 54 Dec 13 05:50 /etc/alternatives/nvidia--libnvcuvid.so-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so
lrwxrwxrwx 1 root root 56 Dec 13 05:50 /etc/alternatives/nvidia--libnvcuvid.so.1-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libnvcuvid.so.1
lrwxrwxrwx 1 root root 59 Dec 13 05:50 /etc/alternatives/nvidia--libnvidia-cfg.so.1-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-cfg.so.1
lrwxrwxrwx 1 root root 62 Dec 13 05:50 /etc/alternatives/nvidia--libnvidia-encode.so.1-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-encode.so.1
lrwxrwxrwx 1 root root 58 Dec 13 05:50 /etc/alternatives/nvidia--libnvidia-ml.so.1-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ml.so.1
lrwxrwxrwx 1 root root 58 Dec 13 05:50 /etc/alternatives/nvidia--libnvidia-nvvm.so-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so
lrwxrwxrwx 1 root root 60 Dec 13 05:50 /etc/alternatives/nvidia--libnvidia-nvvm.so.4-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-nvvm.so.4
lrwxrwxrwx 1 root root 70 Dec 13 05:50 /etc/alternatives/nvidia--libnvidia-ptxjitcompiler.so.1-x86_64-linux-gnu -> /usr/lib/x86_64-linux-gnu/nvidia/current/libnvidia-ptxjitcompiler.so.1
lrwxrwxrwx 1 root root 50 Dec 13 05:50 /etc/alternatives/nvidia--nvidia-blacklists-nouveau.conf -> /etc/nvidia/current/nvidia-blacklists-nouveau.conf
lrwxrwxrwx 1 root root 47 Dec 13 05:50 /etc/alternatives/nvidia--nvidia-drm-outputclass.conf -> /etc/nvidia/current/nvidia-drm-outputclass.conf
lrwxrwxrwx 1 root root 36 Dec 13 05:50 /etc/alternatives/nvidia--nvidia-load.conf -> /etc/nvidia/current/nvidia-load.conf
lrwxrwxrwx 1 root root 40 Dec 13 05:50 /etc/alternatives/nvidia--nvidia-modprobe.conf -> /etc/nvidia/current/nvidia-modprobe.conf
lrwxrwxrwx 1 root root 34 Dec 13 05:50 /etc/alternatives/nvidia--nvidia-smi -> /usr/lib/nvidia/current/nvidia-smi
lrwxrwxrwx 1 root root 39 Dec 13 05:50 /etc/alternatives/nvidia--nvidia-smi.1.gz -> /usr/lib/nvidia/current/nvidia-smi.1.gz
Jip-Hop commented 1 year ago

Ah, progress :) Yes, now nvidia-smi picks it up in the jail itself, however it fails inside containers running in the jail. Looks like /usr/lib/nvidia/current needs to be in the system path, imagine that would be better to do with the script?

Let's focus on this first. If we can get nvidia drivers working inside the jail with the current approach, it should not be far away to make it work inside a docker container in the jail as well.

I agree the old approach is tempting at this point. But I'd prefer a solution which doesn't involve downloading, unpacking and installing drivers each time a jail needs to be created (all the required files are already present on the host after all).

Please try one more time with the latest script and report the output of nvidia-smi in the jail. Would be good to verify if using GPU acceleration works inside a jail directly.

Ixian commented 1 year ago

Ah, progress :) Yes, now nvidia-smi picks it up in the jail itself, however it fails inside containers running in the jail. Looks like /usr/lib/nvidia/current needs to be in the system path, imagine that would be better to do with the script?

Let's focus on this first. If we can get nvidia drivers working inside the jail with the current approach, it should not be far away to make it work inside a docker container in the jail as well.

I agree the old approach is tempting at this point. But I'd prefer a solution which doesn't involve downloading, unpacking and installing drivers each time a jail needs to be created (all the required files are already present on the host after all).

Please try one more time with the latest script and report the output of nvidia-smi in the jail. Would be good to verify if using GPU acceleration works inside a jail directly.

Different error now (btw I am doing this with clean jails so I can start fresh each testing round):

Preparing to unpack .../libnvidia-container1_1.12.0-1_amd64.deb ...
Unpacking libnvidia-container1:amd64 (1.12.0-1) ...
dpkg: error processing archive /var/cache/apt/archives/libnvidia-container1_1.12.0-1_amd64.deb (--unpack):
 unable to make backup link of './usr/lib/x86_64-linux-gnu/libnvidia-container-go.so.1' before installing new version: Invalid cross-device link
Errors were encountered while processing:
 /var/cache/apt/archives/libnvidia-container1_1.12.0-1_amd64.deb
E: Sub-process /usr/bin/dpkg returned an error code (1)

When I try to install nvidia-container-toolkit per the Nvidia instructions.

All the Googling I've done so far on the error returns is some variation of "re-install the drivers" so guessing some link is still missing or broken.

Ixian commented 1 year ago

Update: It's possible now that we can skip the step to install the container toolkit since we're pulling that in from the host too. That may be why it fails installing the container toolkit in the jail now (noticed the directory for that file is read-only inside the jail, for understandable reasons given we don't want to mess with the host).

With a fresh jail install using the latest script, I can open a shell in to the jail and successfully run Nvidia-smi. I also updated the docker daemon.json file to use the nvidia runtime. However I still get the same error when I try to have a container inside the jail run Nvidia-smi:

root@jaildock:~# sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
NVIDIA-SMI couldn't find libnvidia-ml.so library in your system. Please make sure that the NVIDIA Display Driver is properly installed and present in your system.
Please also try adding directory that contains libnvidia-ml.so to your system PATH.
root@jaildock:~#
CompyENG commented 1 year ago

I've been poking at this tonight as I've been trying for a while to upgrade my TrueNAS, found this script, but didn't realize that nvidia still had some problems.

And... I got something working!

root@docker:~# docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Thu Mar  2 01:42:09 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.65.01    Driver Version: 515.65.01    CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:0A:00.0 Off |                  N/A |
|  0%   42C    P0    23W / 150W |      0MiB /  6144MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Modifications I made on top of the latest version of the script:

Edit: Seems like only ldconfig is necessary from the above. However, even with this I can't get plex to use HW transcoding. nvidia-smi works in my Plex container, but it refuses to use HW transcoding.

I was able to get /dev/nvidia-modeset to pop into existence by starting the nvidia-persistencd service on truenas. Not sure if this is important or not.

Jip-Hop commented 1 year ago

Thanks for all the input both of you! Glad to see it starts working in the jail, and somewhat inside docker too already :') My intention wasn't to also mount the nvidia-container-toolkit from the host. Probably need to narrow down again to a minimal list of driver files required. Then the driver should work in the jail, and installing nvidia-container-toolkit manually should work. Once nvidia-container-toolkit is installed manually I expect the GPU to work properly inside docker too. Should also look into nvidia-persistencd... interesting finding!

Jip-Hop commented 1 year ago

I've limited the list of files being mounted again. Hopefully driver still works and allows installing nvidia-container-toolkit manually.

Ixian commented 1 year ago

Well well, look what we have here:

plexhw

Running in a container in the jail :)

I used the latest script, installed docker and the Nvidia toolkit as normal (no errors), last change I had to make was one of the ones @TrueJournals suggested `created /etc/ld.so.conf.d/nvidia.conf with: /usr/lib/x86_64-linux-gnu/nvidia/current and ran ldconfig' . That cleared up the last path issue (and I learned something more about how ldconfig works) from there I just mounted my external directories (docker apps and media, same ones I use to run docker directly on the host today), pulled the container images down, and brought my compose stack up.

I'm still testing various things but so far, so good. Plex hw transcoding works :) I need to go through all my apps (I have a couple dozen in my Compose stacks) so fingers crossed!

Ixian commented 1 year ago

Update: minor error with the script now:

sudo ./start-jail.sh
Config loaded!
nvidia-container-cli: initialization error: nvml error: driver not loaded

Failed to run nvidia-container-cli.
Unable to detect which nvidia driver files to mount.
Falling back to hard-coded list of nvidia files...

Attempting to run nvidia-container-cli inside the jail produces the following:

nvidia-container-cli list
nvidia-container-cli: initialization error: nvml error: driver not loaded

However nvidia-smi still works as does hw transcoding in Plex. Still, we should chase this down because there are probably going to be other problems due to this error.

Jip-Hop commented 1 year ago

I've updated the script again. We are making progress! Thanks @Ixian and @TrueJournals!

Jip-Hop commented 1 year ago

New script already calls ldconfig with appropriate config file. So should only need to install docker and the nvidia container toolkit (hopefully no further action).

Ixian commented 1 year ago

I just tried your latest version @Jip-Hop but the error I mention with nvidia-container-cli is still present.

Maybe this will help: Here's the output of dpkg -l '*nvidia*' inside the jail:

dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
+++-=============================-============-============-=====================================================
ii  libnvidia-container-tools     1.12.0-1     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.12.0-1     amd64        NVIDIA container runtime library
un  nvidia-container-runtime      <none>       <none>       (no description available)
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.12.0-1     amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base 1.12.0-1     amd64        NVIDIA Container Toolkit Base

And here is the output of the same from the Scale host:

dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                   Version       Architecture Description
+++-======================================-=============-============-=================================================================
un  bumblebee-nvidia                       <none>        <none>       (no description available)
ii  glx-alternative-nvidia                 1.2.1~deb11u1 amd64        allows the selection of NVIDIA as GLX provider
un  libegl1-glvnd-nvidia                   <none>        <none>       (no description available)
un  libgl1-glvnd-nvidia-glx                <none>        <none>       (no description available)
un  libgl1-nvidia-glx                      <none>        <none>       (no description available)
un  libgl1-nvidia-legacy-390xx-glx         <none>        <none>       (no description available)
un  libgl1-nvidia-tesla-418-glx            <none>        <none>       (no description available)
un  libgldispatch0-nvidia                  <none>        <none>       (no description available)
un  libgles1-glvnd-nvidia                  <none>        <none>       (no description available)
un  libgles2-glvnd-nvidia                  <none>        <none>       (no description available)
un  libglvnd0-nvidia                       <none>        <none>       (no description available)
ii  libglx-nvidia0:amd64                   515.65.01-1   amd64        NVIDIA binary GLX library
un  libglx0-glvnd-nvidia                   <none>        <none>       (no description available)
ii  libnvidia-cfg1:amd64                   515.65.01-1   amd64        NVIDIA binary OpenGL/GLX configuration library
un  libnvidia-cfg1-any                     <none>        <none>       (no description available)
ii  libnvidia-container-tools              1.11.0-1      amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64             1.11.0-1      amd64        NVIDIA container runtime library
ii  libnvidia-eglcore:amd64                515.65.01-1   amd64        NVIDIA binary EGL core libraries
un  libnvidia-eglcore-515.65.01            <none>        <none>       (no description available)
ii  libnvidia-encode1:amd64                515.65.01-1   amd64        NVENC Video Encoding runtime library
ii  libnvidia-glcore:amd64                 515.65.01-1   amd64        NVIDIA binary OpenGL/GLX core libraries
un  libnvidia-glcore-515.65.01             <none>        <none>       (no description available)
ii  libnvidia-glvkspirv:amd64              515.65.01-1   amd64        NVIDIA binary Vulkan Spir-V compiler library
un  libnvidia-glvkspirv-515.65.01          <none>        <none>       (no description available)
un  libnvidia-legacy-340xx-cfg1            <none>        <none>       (no description available)
un  libnvidia-legacy-390xx-cfg1            <none>        <none>       (no description available)
un  libnvidia-ml.so.1                      <none>        <none>       (no description available)
ii  libnvidia-ml1:amd64                    515.65.01-1   amd64        NVIDIA Management Library (NVML) runtime library
ii  libnvidia-nvvm4:amd64                  515.65.01-1   amd64        NVIDIA NVVM
ii  libnvidia-ptxjitcompiler1:amd64        515.65.01-1   amd64        NVIDIA PTX JIT Compiler
ii  libnvidia-rtcore:amd64                 515.65.01-1   amd64        NVIDIA binary Vulkan ray tracing (rtcore) library
un  libnvidia-rtcore-515.65.01             <none>        <none>       (no description available)
un  libnvidia-tesla-cfg1                   <none>        <none>       (no description available)
un  libopengl0-glvnd-nvidia                <none>        <none>       (no description available)
ii  nvidia-alternative                     515.65.01-1   amd64        allows the selection of NVIDIA as GLX provider
un  nvidia-alternative--kmod-alias         <none>        <none>       (no description available)
un  nvidia-alternative-legacy-173xx        <none>        <none>       (no description available)
un  nvidia-alternative-legacy-71xx         <none>        <none>       (no description available)
un  nvidia-alternative-legacy-96xx         <none>        <none>       (no description available)
ii  nvidia-container-runtime               3.11.0-1      all          NVIDIA container runtime
un  nvidia-container-runtime-hook          <none>        <none>       (no description available)
ii  nvidia-container-toolkit               1.11.0-1      amd64        NVIDIA Container toolkit
ii  nvidia-container-toolkit-base          1.11.0-1      amd64        NVIDIA Container Toolkit Base
un  nvidia-cuda-mps                        <none>        <none>       (no description available)
un  nvidia-current                         <none>        <none>       (no description available)
un  nvidia-current-updates                 <none>        <none>       (no description available)
un  nvidia-driver                          <none>        <none>       (no description available)
un  nvidia-driver-any                      <none>        <none>       (no description available)
un  nvidia-driver-binary                   <none>        <none>       (no description available)
ii  nvidia-installer-cleanup               20151021+13   amd64        cleanup after driver installation with the nvidia-installer
un  nvidia-kernel-515.65.01                <none>        <none>       (no description available)
ii  nvidia-kernel-common                   20151021+13   amd64        NVIDIA binary kernel module support files
ii  nvidia-kernel-dkms                     515.65.01-1   amd64        NVIDIA binary kernel module DKMS source
un  nvidia-kernel-open-dkms-515.65.01      <none>        <none>       (no description available)
un  nvidia-kernel-source                   <none>        <none>       (no description available)
ii  nvidia-kernel-support                  515.65.01-1   amd64        NVIDIA binary kernel module support files
un  nvidia-kernel-support--v1              <none>        <none>       (no description available)
un  nvidia-kernel-support-any              <none>        <none>       (no description available)
un  nvidia-legacy-304xx-alternative        <none>        <none>       (no description available)
un  nvidia-legacy-304xx-driver             <none>        <none>       (no description available)
un  nvidia-legacy-340xx-alternative        <none>        <none>       (no description available)
un  nvidia-legacy-390xx-vulkan-icd         <none>        <none>       (no description available)
ii  nvidia-legacy-check                    515.65.01-1   amd64        check for NVIDIA GPUs requiring a legacy driver
ii  nvidia-modprobe                        515.65.01-1   amd64        utility to load NVIDIA kernel modules and create device nodes
un  nvidia-nonglvnd-vulkan-common          <none>        <none>       (no description available)
un  nvidia-nonglvnd-vulkan-icd             <none>        <none>       (no description available)
ii  nvidia-persistenced                    515.65.01-1   amd64        daemon to maintain persistent software state in the NVIDIA driver
un  nvidia-settings                        <none>        <none>       (no description available)
ii  nvidia-smi                             515.65.01-1   amd64        NVIDIA System Management Interface
ii  nvidia-support                         20151021+13   amd64        NVIDIA binary graphics driver support files
un  nvidia-tesla-418-vulkan-icd            <none>        <none>       (no description available)
un  nvidia-tesla-440-vulkan-icd            <none>        <none>       (no description available)
un  nvidia-tesla-alternative               <none>        <none>       (no description available)
un  nvidia-vdpau-driver                    <none>        <none>       (no description available)
ii  nvidia-vulkan-common                   515.65.01-1   amd64        NVIDIA Vulkan driver - common files
ii  nvidia-vulkan-icd:amd64                515.65.01-1   amd64        NVIDIA Vulkan installable client driver (ICD)
un  nvidia-vulkan-icd-any                  <none>        <none>       (no description available)
rc  xserver-xorg-video-nvidia              515.65.01-1   amd64        NVIDIA binary Xorg driver
un  xserver-xorg-video-nvidia-any          <none>        <none>       (no description available)
un  xserver-xorg-video-nvidia-legacy-304xx <none>        <none>       (no description available)
Jip-Hop commented 1 year ago

Maybe try a reboot?

nvidia-container-cli: initialization error: nvml error: driver not loaded

That's from the host (TrueNAS). That's not supposed to fail on a system with nvidia card... And I think it was working for you before? Hope we didn't break your TrueNAS installation.

Ixian commented 1 year ago

It works if I run it (the command from the cli) on the host, which is odd.

I noticed something interesting - it fails when the jail starts, and fails running inside the jail from the cli until I start my compose stack. As soon as Plex/etc. come up it works so something obviously is being initialized.

Jip-Hop commented 1 year ago

And you didn't notice this behavior with the previous method (downloading and installing the drivers from the .run file)?

Ixian commented 1 year ago

And you didn't notice this behavior with the previous method (downloading and installing the drivers from the .run file)?

Correct.

Jip-Hop commented 1 year ago

Here's a list of all files that were created (L) or changed (X) since installing the nvidia driver with the .run file (previously working method).

LEFT_DIR=rootfs
RIGHT_DIR=rootfs_before
rsync -rinl --ignore-existing "$LEFT_DIR"/ "$RIGHT_DIR"/|sed -e 's/^[^ ]* /L             /'
rsync -rinl --ignore-existing "$RIGHT_DIR"/ "$LEFT_DIR"/|sed -e 's/^[^ ]* /R             /'
rsync -rinl --existing "$LEFT_DIR"/ "$RIGHT_DIR"/|sed -e 's/^/X /'
L             etc/OpenCL/
L             etc/OpenCL/vendors/
L             etc/OpenCL/vendors/nvidia.icd
L             etc/systemd/system/systemd-hibernate.service.wants/
L             etc/systemd/system/systemd-hibernate.service.wants/nvidia-hibernate.service -> /usr/lib/systemd/system/nvidia-hibernate.service
L             etc/systemd/system/systemd-hibernate.service.wants/nvidia-resume.service -> /usr/lib/systemd/system/nvidia-resume.service
L             etc/systemd/system/systemd-suspend.service.wants/
L             etc/systemd/system/systemd-suspend.service.wants/nvidia-resume.service -> /usr/lib/systemd/system/nvidia-resume.service
L             etc/systemd/system/systemd-suspend.service.wants/nvidia-suspend.service -> /usr/lib/systemd/system/nvidia-suspend.service
L             etc/vulkan/
L             etc/vulkan/icd.d/
L             etc/vulkan/icd.d/nvidia_icd.json
L             etc/vulkan/implicit_layer.d/
L             etc/vulkan/implicit_layer.d/nvidia_layers.json
L             usr/bin/nvidia-bug-report.sh
L             usr/bin/nvidia-cuda-mps-control
L             usr/bin/nvidia-cuda-mps-server
L             usr/bin/nvidia-debugdump
L             usr/bin/nvidia-installer
L             usr/bin/nvidia-modprobe
L             usr/bin/nvidia-ngx-updater
L             usr/bin/nvidia-persistenced
L             usr/bin/nvidia-powerd
L             usr/bin/nvidia-settings
L             usr/bin/nvidia-sleep.sh
L             usr/bin/nvidia-smi
L             usr/bin/nvidia-uninstall -> nvidia-installer
L             usr/bin/nvidia-xconfig
L             usr/lib/libGL.so.1 -> /usr/lib/x86_64-linux-gnu/libGL.so.1
L             usr/lib/firmware/
L             usr/lib/firmware/nvidia/
L             usr/lib/firmware/nvidia/515.65.01/
L             usr/lib/firmware/nvidia/515.65.01/gsp.bin
L             usr/lib/nvidia/
L             usr/lib/nvidia/egl_dummy_vendor.json
L             usr/lib/nvidia/glvnd_check
L             usr/lib/nvidia/libGLX_installcheck.so.0
L             usr/lib/systemd/system-sleep/nvidia
L             usr/lib/systemd/system/nvidia-hibernate.service
L             usr/lib/systemd/system/nvidia-powerd.service
L             usr/lib/systemd/system/nvidia-resume.service
L             usr/lib/systemd/system/nvidia-suspend.service
L             usr/lib/x86_64-linux-gnu/libEGL.so -> libEGL.so.1
L             usr/lib/x86_64-linux-gnu/libEGL.so.1 -> libEGL.so.1.1.0
L             usr/lib/x86_64-linux-gnu/libEGL.so.1.1.0
L             usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0 -> libEGL_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libGL.so -> libGL.so.1
L             usr/lib/x86_64-linux-gnu/libGL.so.1 -> libGL.so.1.7.0
L             usr/lib/x86_64-linux-gnu/libGL.so.1.7.0
L             usr/lib/x86_64-linux-gnu/libGLESv1_CM.so -> libGLESv1_CM.so.1
L             usr/lib/x86_64-linux-gnu/libGLESv1_CM.so.1 -> libGLESv1_CM.so.1.2.0
L             usr/lib/x86_64-linux-gnu/libGLESv1_CM.so.1.2.0
L             usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.1 -> libGLESv1_CM_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libGLESv1_CM_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libGLESv2.so -> libGLESv2.so.2
L             usr/lib/x86_64-linux-gnu/libGLESv2.so.2 -> libGLESv2.so.2.1.0
L             usr/lib/x86_64-linux-gnu/libGLESv2.so.2.1.0
L             usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.2 -> libGLESv2_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libGLESv2_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libGLX.so -> libGLX.so.0
L             usr/lib/x86_64-linux-gnu/libGLX.so.0
L             usr/lib/x86_64-linux-gnu/libGLX_indirect.so.0 -> libGLX_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.0 -> libGLX_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libGLX_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libGLdispatch.so.0
L             usr/lib/x86_64-linux-gnu/libOpenCL.so -> libOpenCL.so.1
L             usr/lib/x86_64-linux-gnu/libOpenCL.so.1 -> libOpenCL.so.1.0
L             usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0 -> libOpenCL.so.1.0.0
L             usr/lib/x86_64-linux-gnu/libOpenCL.so.1.0.0
L             usr/lib/x86_64-linux-gnu/libOpenGL.so -> libOpenGL.so.0
L             usr/lib/x86_64-linux-gnu/libOpenGL.so.0
L             usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
L             usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libcuda.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvcuvid.so -> libnvcuvid.so.1
L             usr/lib/x86_64-linux-gnu/libnvcuvid.so.1 -> libnvcuvid.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvcuvid.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-allocator.so -> libnvidia-allocator.so.1
L             usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1 -> libnvidia-allocator.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-cfg.so -> libnvidia-cfg.so.1
L             usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1 -> libnvidia-cfg.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1 -> libnvidia-egl-gbm.so.1.1.0
L             usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.0
L             usr/lib/x86_64-linux-gnu/libnvidia-egl-wayland.so.1 -> libnvidia-egl-wayland.so.1.1.9
L             usr/lib/x86_64-linux-gnu/libnvidia-egl-wayland.so.1.1.9
L             usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-encode.so -> libnvidia-encode.so.1
L             usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1 -> libnvidia-encode.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-encode.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-fbc.so -> libnvidia-fbc.so.1
L             usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1 -> libnvidia-fbc.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-gtk2.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-gtk3.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-ml.so -> libnvidia-ml.so.1
L             usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 -> libnvidia-ml.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-ml.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1 -> libnvidia-ngx.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so -> libnvidia-nvvm.so.4
L             usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4 -> libnvidia-nvvm.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1 -> libnvidia-opencl.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so -> libnvidia-opticalflow.so.1
L             usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1 -> libnvidia-opticalflow.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so -> libnvidia-ptxjitcompiler.so.1
L             usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1 -> libnvidia-ptxjitcompiler.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-tls.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-vulkan-producer.so -> libnvidia-vulkan-producer.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-vulkan-producer.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvidia-wayland-client.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvoptix.so.1 -> libnvoptix.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libnvoptix.so.515.65.01
L             usr/lib/x86_64-linux-gnu/libvdpau_nvidia.so -> vdpau/libvdpau_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/gbm/
L             usr/lib/x86_64-linux-gnu/gbm/nvidia-drm_gbm.so -> /usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1
L             usr/lib/x86_64-linux-gnu/nvidia/
L             usr/lib/x86_64-linux-gnu/nvidia/wine/
L             usr/lib/x86_64-linux-gnu/nvidia/wine/_nvngx.dll
L             usr/lib/x86_64-linux-gnu/nvidia/wine/nvngx.dll
L             usr/lib/x86_64-linux-gnu/vdpau/
L             usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.1 -> libvdpau_nvidia.so.515.65.01
L             usr/lib/x86_64-linux-gnu/vdpau/libvdpau_nvidia.so.515.65.01
L             usr/lib64/xorg/
L             usr/lib64/xorg/modules/
L             usr/lib64/xorg/modules/drivers/
L             usr/lib64/xorg/modules/drivers/nvidia_drv.so
L             usr/lib64/xorg/modules/extensions/
L             usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so -> libglxserver_nvidia.so.515.65.01
L             usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.515.65.01
L             usr/share/applications/nvidia-settings.desktop
L             usr/share/doc/NVIDIA_GLX-1.0/
L             usr/share/doc/NVIDIA_GLX-1.0/LICENSE
L             usr/share/doc/NVIDIA_GLX-1.0/NVIDIA_Changelog
L             usr/share/doc/NVIDIA_GLX-1.0/README.txt
L             usr/share/doc/NVIDIA_GLX-1.0/nvidia-dbus.conf
L             usr/share/doc/NVIDIA_GLX-1.0/nvidia-settings.png
L             usr/share/doc/NVIDIA_GLX-1.0/html/
L             usr/share/doc/NVIDIA_GLX-1.0/html/acknowledgements.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/addressingcapabilities.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/addtlresources.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/appendices.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/audiosupport.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/commonproblems.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/configlaptop.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/configmultxscreens.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/configtwinview.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/depth30.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/displaydevicenames.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/dma_issues.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/dpi.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/dynamicboost.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/dynamicpowermanagement.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/editxconfig.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/egpu.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/faq.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/flippingubb.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/framelock.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/gbm.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/glxsupport.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/gpunames.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/gsp.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/i2c.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/index.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/installationandconfiguration.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/installdriver.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/installedcomponents.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/introduction.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/kernel_open.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/kms.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/knownissues.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/minimumrequirements.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/newusertips.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/ngx.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/nvidia-debugdump.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/nvidia-ml.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/nvidia-peermem.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/nvidia-persistenced.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/nvidia-smi.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/nvidiasettings.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/openglenvvariables.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/optimus.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/powermanagement.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/primerenderoffload.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/procinterface.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/profiles.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/programmingmodes.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/randr14.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/retpoline.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/selectdriver.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/sli.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/supportedchips.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/vdpausupport.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/wayland-issues.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/xcompositeextension.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/xconfigoptions.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/xineramaglx.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/xrandrextension.html
L             usr/share/doc/NVIDIA_GLX-1.0/html/xwayland.html
L             usr/share/doc/NVIDIA_GLX-1.0/samples/
L             usr/share/doc/NVIDIA_GLX-1.0/samples/nvidia-persistenced-init.tar.bz2
L             usr/share/doc/NVIDIA_GLX-1.0/supported-gpus/
L             usr/share/doc/NVIDIA_GLX-1.0/supported-gpus/LICENSE
L             usr/share/doc/NVIDIA_GLX-1.0/supported-gpus/supported-gpus.json
L             usr/share/egl/
L             usr/share/egl/egl_external_platform.d/
L             usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json
L             usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json
L             usr/share/glvnd/
L             usr/share/glvnd/egl_vendor.d/
L             usr/share/glvnd/egl_vendor.d/10_nvidia.json
L             usr/share/man/man1/nvidia-cuda-mps-control.1.gz
L             usr/share/man/man1/nvidia-installer.1.gz
L             usr/share/man/man1/nvidia-modprobe.1.gz
L             usr/share/man/man1/nvidia-persistenced.1.gz
L             usr/share/man/man1/nvidia-settings.1.gz
L             usr/share/man/man1/nvidia-smi.1.gz
L             usr/share/man/man1/nvidia-xconfig.1.gz
L             usr/share/nvidia/
L             usr/share/nvidia/nvidia-application-profiles-515.65.01-key-documentation
L             usr/share/nvidia/nvidia-application-profiles-515.65.01-rc
L             var/lib/nvidia/
L             var/lib/nvidia/dirs
L             var/lib/nvidia/log
X >f.sT...... etc/ld.so.cache
X >f.sT...... root/.bash_history
X >f.sT...... var/cache/ldconfig/aux-cache
X >f..T...... var/log/lastlog
X >f.sT...... var/log/nvidia-installer.log
X >f.sT...... var/log/wtmp
X >f..T...... var/log/journal/f0db7addd78847dfb4ed5576d9813374/system.journal
CompyENG commented 1 year ago

I'm not exactly sure what the difference is between now and what I was doing yesterday, but Plex transcoding is now working for me as well!

Used the latest version of the script, created container and installed docker and nvidia container toolkit inside. Started Plex and it's HW transcoding now. Awesome!

Jip-Hop commented 1 year ago

Glad to hear it's working for you now!

Did you run into the same issue as @Ixian?

Would be great if you could try to run nvidia-container-cli list (on the host as well as in the jail) while your plex container is running and see if you run into initialization error: nvml error: driver not loaded.

Perhaps even try to make a few jails, all with GPU passthrough, to test if the GPU can be properly accessed simultaneously.

I just opened this issue to see if we're still missing something in our setup and ask how to solve the initialization error. Additional data points would be very helpful.

Talung commented 1 year ago

Tried the new script and having a crazy time trying to get Debian11 to work and failed big time. Switched over to Ubuntu jammy and no issues. nvidia-smi worked right off, but obviously no nvidia docker runtime.

This was my process, in case anybody wants to repeat it:

Install Docker

curl https://get.docker.com | sh \ && sudo systemctl --now enable docker

test nvidia

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi

Install nvidia-container-toolkit

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \ && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \ && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \ sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \ sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

apt update

apt install -y nvidia-container-toolkit

Configure daemon to recognise nvidia

nvidia-ctk runtime configure --runtime=docker systemctl restart docker

test again

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi