Integrate the NVIDIA container toolkit

ehfd commented 7 months ago

This is an issue that has been spun off from the Discord channel.

@Murazaki : It might be good to find a better workflow for providing drivers to Wolf. On Debian, drivers are pretty old in the main stable repo, and updated ones can be found on CUDA drivers, but do not exactly match manual installation ones.

@ABeltramo : I guess I should go back to look into the Nvidia Docker Toolkit for people that would like to use that I agree though, it's a bit of a pain point at the moment

@Murazaki : Cuda drivers repo : https://developer.download.nvidia.com/compute/cuda/repos/

Linux manual installer : https://download.nvidia.com/XFree86/Linux-x86_64/

right now, latest in cuda packaged installs is 545.23.08. It doesn't exist as a manual installer. that breaks the dockerfile and renders wolf unusable I wanted to make a docker image for debian packages install, but it uses apt-add-repository which is installing a bunch of supplementary stuff Here it is for Debian Bookworm : https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=Debian&target_version=12&target_type=deb_network

More thorough installation steps here : https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

@juliosueiras : There is one problem though, nvidia driver toolkit doesn’t inject driver in the container and still require a driver installed in the container image itself,

And here, I start with what interventions I made with NVIDIA for the last three years to not require the NVIDIA drivers to run Wayland inside the NVIDIA container toolkit.

What NVIDIA container toolkit does: it's pretty simple. It injects (1) kernel devices, and (2) userspace libraries, into a container. (1) and (2) compose a subset of the driver.

(1) kernel devices: /dev/nvidiaN, /dev/nvidiactl, /dev/nvidia-modeset, /dev/nvidia-uvm, and /dev/nvidia-uvm-tools. In addition, /dev/dri/cardX and /dev/dri/renderDY, where N, X, and Y depend on the GPU the container toolkit provisions. The /dev/dri devices were added with https://github.com/NVIDIA/libnvidia-container/issues/118.

(2) userspace libraries: OpenGL libraries including EGL: '/usr/lib/libGL.so.1', '/usr/lib/libEGL.so.1', '/usr/lib/libGLESv1_CM.so.525.78.01', '/usr/lib/libGLESv2.so.525.78.01', '/usr/lib/libEGL_nvidia.so.0', '/usr/lib/libOpenGL.so.0', '/usr/lib/libGLX.so.0', and '/usr/lib/libGLdispatch.so.0', '/usr/lib/libnvidia-tls.so.525.78.01'

Vulkan libraries: '/usr/lib/libGLX_nvidia.so.0' and the configuration '/etc/vulkan/icd.d/nvidia_icd.json'

EGLStreams-Wayland and GBM-Wayland libraries: '/usr/lib/libnvidia-egl-wayland.so.1' and the config '/usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json' '/usr/lib/libnvidia-egl-gbm.so.1' and the config '/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json'

NVENC libraries: /usr/lib/libnvidia-encode.so.525.78.01, which depends on /usr/lib/libnvcuvid.so.525.78.01, which depends on /usr/lib/x86_64-linux-gnu/libcuda.so.1

VDPAU libraries: /usr/lib/vdpau/libvdpau_nvidia.so.525.78.01 NVFBC libraries: /usr/lib/libnvidia-fbc.so.525.78.01 OPTIX libraries: /usr/lib/libnvoptix.so.1

Not very relevant but of note, perhaps for XWayland: NVIDIA X.Org driver: /usr/lib/xorg/modules/drivers/nvidia_drv.so, NVIDIA X.org GLX driver: /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01

In many cases, things don't work because the below configuration files are absent inside the container. Without these, applications inside the container don't know which library to call (what each file does is self-explanatory):

The contents of /usr/share/glvnd/egl_vendor.d/10_nvidia.json:

{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libEGL_nvidia.so.0"
    }
}

The contents of /etc/vulkan/icd.d/nvidia_icd.json (note that api_version is variable based on the Driver version):

{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.3.205"
    }
}

The contents of /etc/OpenCL/vendors/nvidia.icd:

libnvidia-opencl.so.1

The contents of /usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json:

{
        "file_format_version" : "1.0.0",
        "ICD" : {
                "library_path" : "libnvidia-egl-gbm.so.1"
        }
}

The contents of /usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json:

{
    "file_format_version" : "1.0.0",
    "ICD" : {
        "library_path" : "libnvidia-egl-wayland.so.1"
    }
}

I'm pretty sure that now (was different a few months ago), the newest NVIDIA container toolkit provisions all of the required libraries plus the json configurations for Wayland (not for X11 but you don't have to care). If only the json configurations are absent, it's trivial to manually add the above template.

About GStreamer: https://gitlab.freedesktop.org/gstreamer/gstreamer/-/issues/3108

Now, it is correct that NVENC does require CUDA. But that doesn't mean that it requires the whole CUDA Toolkit (separate from the CUDA drivers). The CUDA drivers are the four following libraries installed with the display drivers, independent of the CUDA Toolkit: libcuda.so, libnvidia-ptxjitcompiler.so, libnvidia-nvvm.so, libcudadebugger.so

These versions go with the display drivers, and are all injected into the container by the NVIDIA container toolkit.

GStreamer 1.22 and before in nvcodec requires just two files of the CUDA Toolkit: libnvrtc.so and libnvrtc-bulletins.so. This can be installed from the network repository like the current approach, or be extracted from a PyPi package:

# Extract NVRTC dependency, https://developer.download.nvidia.com/compute/cuda/redist/cuda_nvrtc/LICENSE.txt
cd /tmp && curl -fsSL -o nvidia_cuda_nvrtc_linux_x86_64.whl "https://developer.download.nvidia.com/compute/redist/nvidia-cuda-nvrtc/nvidia_cuda_nvrtc-11.0.221-cp36-cp36m-linux_x86_64.whl" && unzip -joq -d ./nvrtc nvidia_cuda_nvrtc_linux_x86_64.whl && cd nvrtc && chmod 755 libnvrtc* && find . -maxdepth 1 -type f -name "*libnvrtc.so.*" -exec sh -c 'ln -snf $(basename {}) libnvrtc.so' \; && mv -f libnvrtc* /opt/gstreamer/lib/x86_64-linux-gnu/ && cd /tmp && rm -rf /tmp/*

One thing to note here is that libnvrtc.so is not minor version compatible with CUDA. Thus, it will error on any display driver version older than its corresponding display driver version. However, backwards compatibility always works. Thus, it is a good idea to use the oldest possible libnvrtc.so version.

Display CUDA
545 - 12.3
535 - 12.2
530 - 12.1
525 - 12.0
520 - 11.8
515 - 11.7
(and so on...)

https://docs.nvidia.com/deploy/cuda-compatibility/

So, I have moderate to high confidence that if you guys try the newest NVIDIA container toolkit again, you won't need to install the drivers, assuming that you ensure the json files are present or written.

Environment variables that currently work for me:

RUN echo "/usr/local/nvidia/lib" >> /etc/ld.so.conf.d/nvidia.conf && \
    echo "/usr/local/nvidia/lib64" >> /etc/ld.so.conf.d/nvidia.conf
# Expose NVIDIA libraries and paths
ENV PATH /usr/local/nvidia/bin:${PATH}
ENV LD_LIBRARY_PATH /usr/lib/x86_64-linux-gnu:/usr/lib/i386-linux-gnu${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
# Make all NVIDIA GPUs visible by default
ENV NVIDIA_VISIBLE_DEVICES all
# All NVIDIA driver capabilities should preferably be used, check `NVIDIA_DRIVER_CAPABILITIES` inside the container if things do not work
ENV NVIDIA_DRIVER_CAPABILITIES all
# Disable VSYNC for NVIDIA GPUs
ENV __GL_SYNC_TO_VBLANK 0

ABeltramo commented 7 months ago

TLDR: as of the latest Nvidia Container Toolkit (1.14.3-1) unfortunately this is still not possible.

What's the issue?

With the latest versions I can run both Wolf and the Gstreamer pipeline just by running the container with --gpus=all unfortunately for some apps this is still missing some important library.
Gamescope to run seems to require the following additional libraries that aren't provided by the toolkit:

libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so

Last one seems to just be a simlink to libnvidia-allocator.so.1 which is already present so that might be fine.

Now this is running from an X11 host and I can see that those additional libraries aren't present in my host system:

ls -la /usr/lib/x86_64-linux-gnu/libnv
libnvcuvid.so@                         libnvidia-container.so.1@              libnvidia-glvkspirv.so.530.30.02       libnvidia-opticalflow.so.1@
libnvcuvid.so.1@                       libnvidia-container.so.1.14.3*         libnvidia-ml.so.1@                     libnvidia-ptxjitcompiler.so.1@
libnvidia-allocator.so.1@              libnvidia-eglcore.so.530.30.02         libnvidia-ngx.so.1@                    libnvidia-rtcore.so.530.30.02
libnvidia-cfg.so.1@                    libnvidia-encode.so.1@                 libnvidia-ngx.so.530.30.02             libnvidia-tls.so.530.30.02
libnvidia-compiler.so.530.30.02        libnvidia-fbc.so.1@                    libnvidia-nvvm.so@                     libnvidia-wayland-client.so.530.30.02
libnvidia-container-go.so.1@           libnvidia-glcore.so.530.30.02          libnvidia-nvvm.so.4@                   libnvoptix.so.1@
libnvidia-container-go.so.1.14.3       libnvidia-glsi.so.530.30.02            libnvidia-opencl.so.1@

Anyone that can confirm the output of

docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all  -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu ls /usr/lib/x86_64-linux-gnu/libnvidia*
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1     /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02     /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1           /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02       /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1             /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1        /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1            /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1           /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02        /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02

on a Nvidia Wayland host?

What can we do better?

I think we should keep manually downloading and linking the drivers like we are doing at the moment. We should probably add a proper check for mismatch between the downloaded drivers and the host installed drivers either on startup of the containers (somewhere in the base-app) or on startup of Wolf.

ehfd commented 5 months ago

@ABeltramo One comment I have here is that there isn't really a concept called an "Xorg host" and a "Wayland host". It depends on what the desktop environment and login manager uses, and by default, all drivers bundle both libraries.

We will discuss more in https://github.com/NVIDIA/libnvidia-container/issues/118.

ohayak commented 4 months ago

Hi, I've been working on a project using Unreal Engine Pixel Streaming to stream games at scale with kubernetes. Packaging docker images with the right nvidia docker drivers is a pain. I invite you to take a look at the work done by Adam He used nvidia/cuda image as base to create a set of images that with various configurations: 22.04-vulkan: Ubuntu 22.04 + OpenGL + Vulkan + PulseAudio Client + PulseAudio Server 22.04-cudagl11: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 + PulseAudio Client + PulseAudio Server 22.04-cudagl12: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 + PulseAudio Client + PulseAudio Server 22.04-vulkan-noaudio: Ubuntu 22.04 + OpenGL + Vulkan (no audio support) 22.04-cudagl11-noaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 (no audio support) 22.04-cudagl12-noaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 (no audio support) 22.04-vulkan-hostaudio: Ubuntu 22.04 + OpenGL + Vulkan + PulseAudio Client (uses host PulseAudio Server) 22.04-cudagl11-hostaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 + PulseAudio Client (uses host PulseAudio Server) 22.04-cudagl12-hostaudio: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 + PulseAudio Client (uses host PulseAudio Server) 22.04-vulkan-x11: Ubuntu 22.04 + OpenGL + Vulkan + PulseAudio Client (uses host PulseAudio Server) + X11 22.04-cudagl11-x11: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 11.8.0 + PulseAudio Client (uses host PulseAudio Server) + X11 22.04-cudagl12-x11: Ubuntu 22.04 + OpenGL + Vulkan + CUDA 12.2.0 + PulseAudio Client (uses host PulseAudio Server) + X11 I think you should check the docker files

Murazaki commented 4 months ago

Hi, I've been working on a project using Unreal Engine Pixel Streaming to stream games at scale with kubernetes. Packaging docker images with the right nvidia docker drivers is a pain. I invite you to take a look at the work done by Adam He used nvidia/cuda image as base to create a set of images that with various configurations [...] I think you should check the docker files

Could you provide a link to that/those Dockerfiles maybe ?

Murazaki commented 4 months ago

Oh sorry I actually know what they're talking about : it's adamrehn/ue4-runtime. I think I wrote an answer and forgot to send ^^

https://github.com/adamrehn/ue4-runtime

ABeltramo commented 4 months ago

If those are the images that @ohayak was talking about, unfortunately, there's nothing there that can help us.
As I explained in a comment above, the nvidia toolkit is not mounting all libraries that are needed in order to spawn our custom Wayland compositor. There are only a few possible solutions that I can think of:

Ask users to install the dependencies in a volume that will be mounted and shared with the apps (the current solution in Wolf)
Ask users to manually install the libraries if missing from the host and then fill out all the correct mounts
Try to fill a patch upstream with the nvidia driver toolkit project and hope that it gets merged

I'm very open to suggestions or alternative solutions!

ehfd commented 4 months ago

modprobe -r nvidia_drm ; modprobe nvidia_drm modeset=1

Our experience was related to the above kernel module. Also, we have reinstalled the OS and installed everything from scratch, then things started to work again. One more possibility is the lack of DKMS (required to build relevant NVIDIA kernel modules) or a kernel version upgrade without rebuilding the NVIDIA modules.

ehfd commented 3 months ago

The core issue is whether the EGL Wayland library is installed or not, likely not the container toolkit.

This is available if you use the .run file. https://download.nvidia.com/XFree86/Linux-x86_64/550.67/README/installedcomponents.html

I don't think the Debian/Ubuntu PPA repositories install the Wayland components automatically. If so, the following components need to be installed. This is sufficient and contains all missing Wayland files.

libnvidia-egl-gbm1
libnvidia-egl-wayland1

But, solution: install the .run file.

https://community.kde.org/Plasma/Wayland/Nvidia

tux-rampage commented 3 months ago

Hi,

I am currently trying to get this flying with CRI-o and the nvidia-ctk by using the runtime and CDI config. Afair this can be used in docker as well. So far all configs and libs are injected/mounted into the container as far as I can see. I can double check for the reported libs here later.

Currently I'm facing the vblank resource unavailable issue. (Driver version 550.something - cannot look it up atm)

tux-rampage commented 3 months ago

Here are some short steps for what I've done so far (Nvidia Driver version 550.54.14):

nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml
nvidia-ctk runtime configure --runtime=crio

For docker I guess it's enough to use --runtime=docker

I used the nvidia runtime when starting the container(s) and setting the following Env-Vars:

NVIDIA_DRIVER_CAPABILITIES=all
NVIDIA_VISIBLE_DEVICES="nvidia.com/gpu=all"

As documented, this enables the CDI integration which will mount the host libs and binaries.

What is working so far:

vkgears (directly)
vkgears (gamescope)

What is not working:

glxgears (gamescope): Failed to load driver zink, Blank surface, success render output on STDOUT
steam (gamescope): Failed to load driver zink, vblank sync failures, shm failures seems to be webkit sandbox related, possibly #60

tux-rampage commented 2 months ago

TLDR: as of the latest Nvidia Container Toolkit (1.14.3-1) unfortunately this is still not possible.

What's the issue?

With the latest versions I can run both Wolf and the Gstreamer pipeline just by running the container with --gpus=all unfortunately for some apps this is still missing some important library. Gamescope to run seems to require the following additional libraries that aren't provided by the toolkit:
libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so
Last one seems to just be a simlink to libnvidia-allocator.so.1 which is already present so that might be fine.

Now this is running from an X11 host and I can see that those additional libraries aren't present in my host system:
ls -la /usr/lib/x86_64-linux-gnu/libnv
libnvcuvid.so@                         libnvidia-container.so.1@              libnvidia-glvkspirv.so.530.30.02       libnvidia-opticalflow.so.1@
libnvcuvid.so.1@                       libnvidia-container.so.1.14.3*         libnvidia-ml.so.1@                     libnvidia-ptxjitcompiler.so.1@
libnvidia-allocator.so.1@              libnvidia-eglcore.so.530.30.02         libnvidia-ngx.so.1@                    libnvidia-rtcore.so.530.30.02
libnvidia-cfg.so.1@                    libnvidia-encode.so.1@                 libnvidia-ngx.so.530.30.02             libnvidia-tls.so.530.30.02
libnvidia-compiler.so.530.30.02        libnvidia-fbc.so.1@                    libnvidia-nvvm.so@                     libnvidia-wayland-client.so.530.30.02
libnvidia-container-go.so.1@           libnvidia-glcore.so.530.30.02          libnvidia-nvvm.so.4@                   libnvoptix.so.1@
libnvidia-container-go.so.1.14.3       libnvidia-glsi.so.530.30.02            libnvidia-opencl.so.1@
Anyone that can confirm the output of
docker run --rm -it --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all  -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu ls /usr/lib/x86_64-linux-gnu/libnvidia*
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1       /usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.530.30.02     /usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1         /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.530.30.02       /usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-compiler.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.530.30.02  /usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.530.30.02   /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1               /usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1          /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1            /usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.530.30.02
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1         /usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.530.30.02        /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.530.30.02
on a Nvidia Wayland host?

What can we do better?

I think we should keep manually downloading and linking the drivers like we are doing at the moment. We should probably add a proper check for mismatch between the downloaded drivers and the host installed drivers either on startup of the containers (somewhere in the base-app) or on startup of Wolf.

ls -1 /usr/lib/x86_64-linux-gnu/libnvidia*
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-allocator.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-cfg.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-egl-gbm.so.1.1.1
/usr/lib/x86_64-linux-gnu/libnvidia-eglcore.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-encode.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-fbc.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-glcore.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-glvkspirv.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-gpucomp.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ngx.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.4
/usr/lib/x86_64-linux-gnu/libnvidia-nvvm.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opencl.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-opticalflow.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-pkcs11-openssl3.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.1
/usr/lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-rtcore.so.550.54.14
/usr/lib/x86_64-linux-gnu/libnvidia-tls.so.550.54.14

It seems that libnvidia-egl-wayland.so.1 and libnvidia-vulkan-producer.so is missing. driver version is 550.54.14

tux-rampage commented 2 months ago

libnvidia-egl-wayland seems to be present in the package libnvidia-egl-wayland1. libnvidia-vulkan-producer.so seems to be dropped from the driver. Need to confirm by building the driver volume for the current release.

Edit: Is it possible that libnvidia-egl-wayland1 is not part of Nvidias driver but an oss component?

Edit2: yes libnvidia-vulkan-producer was removed recently: https://www.nvidia.com/Download/driverResults.aspx/214102/en-us/

Removed libnvidia-vulkan-producer.so from the driver package. This helper library is no longer needed by the Wayland WSI.

ehfd commented 2 months ago

Those libraries are for the EGLStreams backend. I believe compositors have now stopped supporting them.

https://github.com/NVIDIA/egl-wayland

tux-rampage commented 2 months ago

Maybe. The latest Nvidia Driver comes with a gbm backend. Maybe that's something useful. I'll give GBM_BACKEND=nvidia-gbm a try. Maybe this will help: https://download.nvidia.com/XFree86/Linux-x86_64/510.39.01/README/gbm.html

From my approach yesterday evening, glxgears is running without any error messages so far. But the Output in moonlight stays black.

ehfd commented 1 month ago

I think NVIDIA Container Toolkit 1.15.0 (released not long ago) fixes most of the problems.

Please check it out.

I am trying to fix the remainder of issues with https://github.com/NVIDIA/nvidia-container-toolkit/pull/490. Please feedback.

I've written about the situation more detailedly in: https://github.com/NVIDIA/nvidia-container-toolkit/pull/490#issuecomment-2104836490

Within the scope of Wolf, the libnvidia-egl-wayland1 APT package will install the EGLStreams interface (if it can be used instead of GBM), and libnvidia-egl-gbm is installed with the Graphics Drivers PPA. Both are installed with the .run installer and the above PR will also inject libnvidia-egl-wayland.

tux-rampage commented 1 month ago

I've finally been successful to run steam and cyberpunk with wolf using the nvidia container toolkit instead of the drivers image. As @ehfd metioned, the Symlink nvidia-drm_gbm.so is missing, so I had to create it manually before running gamescope/steam:

mkdir -p /usr/lib/x86_64-linux-gnu/gbm;
ln -sv ../libnvidia-allocator.so.1 /usr/lib/x86_64-linux-gnu/gbm/nvidia-drm_gbm.so;

After that launching steam and the game was successful. This all works without the use of libnvidia-egl-wayland.

ABeltramo commented 1 month ago

That sounds really good, thanks for reporting back!
We could easily add that to the images, my only concern is that this only works with a specific version of the toolkit. We should probably add a check on startup for the required libraries and print a proper error message if missing..

ehfd commented 1 month ago

If someone has knowledge of Go, could they contribute fixes for the unsolved aspects of the PR for https://github.com/NVIDIA/nvidia-container-toolkit/pull/490?

I will give write access to https://github.com/ehfd/nvidia-container-toolkit/tree/main if they ask in order to keep it into one PR.

tux-rampage commented 4 weeks ago

If someone has knowledge of Go, could they contribute fixes for the unsolved aspects of the PR for NVIDIA/nvidia-container-toolkit#490?

I will give write access to https://github.com/ehfd/nvidia-container-toolkit/tree/main if they ask in order to keep it into one PR.

I can take a look at it later

tux-rampage commented 4 weeks ago

@ehfd Thanks for trusting with the access to your fork. I have addressed the pending issues on the code side and requested some feedback.

ehfd commented 4 weeks ago

Thanks for trusting with the access to your fork. I have addressed the pending issues on the code side and requested some feedback.

No sweat. You know Go better than me and seems like you did a great job at it.

kayakyakr commented 6 days ago

Congrats on getting this merged! This is going to substantially simplify getting the drivers up and running

ABeltramo commented 4 days ago

I've tried upgrading Gstreamer to 1.24.5 unfortunately it's now failing to use Cuda with:

0:00:00.102290171   172 0x560d70ab1190 ERROR              cudanvrtc gstcudanvrtc.cpp:165:gst_cuda_nvrtc_load_library_once: Failed to load 'nvrtcGetCUBINSize', 'nvrtcGetCUBINSize': /usr/local/nvidia/lib/libnvrtc.so: undefined symbol: nvrtcGetCUBINSize

It looks like nvrtcGetCUBINSize was recently added, so I guess there's a mismatch between the libnvrtc.so and what Gstreamer expects. Seems that this was added in this commit which reference @ehfd issue.
I could successfully run this Gstreamer version on my host which has Cuda12 installed, I guess this will break compatibility with older cards so I'm going to revert back Gstreamer in Wolf for now..

ehfd commented 4 days ago

@ABeltramo This should not be an issue as long as NVRTC CUDA version is kept at around 11.3, yes, 11.0 will not work.

ABeltramo commented 4 days ago

Thanks for the very quick reply! What's the compatibility matrix for NVRTC? Would upgrading to 11.3 still work for older Cuda installations?

ehfd commented 4 days ago

Mostly the internal ABI of NVIDIA. CUBIN was designed so that there can be forward compatibility for the NVRTC files, but since CUBIN itself didn't exist backwardly, there's a problem here.

You can probably fix the issue yourself on GStreamer and then backport it to GStreamer 1.24.6 by a simple error handling in the C code. Probably #ifdef can work.

ABeltramo commented 4 days ago

Thanks, I've got enough on my plate already.. This is definitely lower priority compared to the rest, looks like we are going to stay on 1.22.7 for a bit longer..

games-on-whales / wolf