Automatically provisioning X11 and Wayland devices of GPU inside container?

ehfd commented 3 years ago

Redirected from https://github.com/NVIDIA/k8s-device-plugin/issues/206 to a more suitable repository.

1. Issue or feature description

In docker and kubernetes, people have had to have manual host setup to provision the X server using host path directive /tmp/.X11-unix. This is quite tedious for sysadmins and at the same time a security threat as people can spoof the host.

To mitigate this, there have been attempts (https://github.com/ehfd/docker-nvidia-glx-desktop which is based on https://github.com/ryought/glx-docker-headless-gpu) to execute an X server and use GLX inside the container after getting provisioned the GPU using libnvidia-container.

An alternative was created by the developers at VirtualGL (used widely in HPC to enable GPU-based rendering in VNC virtual display environments) have been able to develop a feature that uses the EGL API to enable 3D GL rendering such as Blender, Matlab, and Unity, previously only possible with GLX and thus an X server. As you guys know well, nvidia-docker does not support GLX but has introduced the EGL API just below two years ago. See EGL config section of https://github.com/VirtualGL/virtualgl/issues/113#issuecomment-693127236

EGL is also required to start a Wayland compositor inside a container with the EGLStreams specification in NVIDIA GPUs, which is the way forward after X11 development has stopped.

These use cases require access to the devices /dev/dri/cardX corresponding to each GPU provisioned using libnvidia-container. However, it does not seem like libnvidia-container provisions this automatically. I would like to ask you whether this is possible, and how this can be configured.

2. Steps to reproduce the issue

Provision one GPU inside container nvidia/cudagl:11.0-devel-ubuntu20.04 or nvidia/opengl:1.2-glvnd-devel-ubuntu20.04 in Docker CE 19.03 (or using one nvidia.com/gpu: 1 with k8s-device-plugin v0.7.0 with default configurations in Kubernetes v1.18.6).

Do: ls /dev

Result: Inside the container you see /dev/nvidiaX, /dev/nvidia-modeset, /dev/nvidia-uvm, /dev/nvidia-uvm-tools, HOWEVER directory /dev/dri does not exist. Wayland compositors are unlikely to start inside a container without DRM devices. VirtualGL does not work through any devices other than /dev/dri/cardX as well.

3. Information to attach (optional if deemed irrelevant)

Other issues and repositories: Example of VirtualGL EGL configuration that requires /dev/dri/cardX: https://github.com/ehfd/docker-nvidia-egl-desktop

Implementation of an unprivileged remote desktop bundling an X server with many hacks: https://github.com/ehfd/docker-nvidia-glx-desktop

klueska commented 3 years ago

I have added this feature request to our backlog. At present we have a big backlog, so it's unclear exactly when we will be able to look at this in detail.

That said, it feels like it could be added as a new NVIDIA_DRIVER_CAPABILITY that tries to look for these devices if they exist and inject them. You would set this capability either in the container image or the command line via an environment variable (which would work in the k8s context as well).

ehfd commented 3 years ago

As you see the thumbs up, this feature is in quite a big demand, so it would be great to be implemented quickly. Thank you.

xkszltl commented 3 years ago

If you get a chance to do that, maybe add /dev/gdrdrv for nvidia gdrcopy as well.

ehfd commented 3 years ago

https://github.com/mviereck/x11docker/wiki/Hardware-acceleration#share-nvidia-device-files-with-container

To use a custom base image, share all files matching /dev/nvidia*, /dev/nvhost* and /dev/nvmap with docker run option --device. Share /dev/dri and /dev/vga_arbiter, too. Add container user to groups video and render with --group-add video --group-add render.

In addition to the initial feature request, these are all the devices required to be provisioned automatically for NVIDIA to officially support Display (e.g. X11, Wayland) in Docker. If these devices are able to be provisioned using the container toolkit automatically, the nvidia/opengl container (nvidia-docker) can properly support the NVIDIA version of XWayland (currently undergoing support into the Linux kernel by NVIDIA devs) and thus support Displays.

There are a lot of people waiting for Display support in Docker and Kubernetes, especially because NVIDIA is to support XWayland in the near future. Please implement this feature to streamline this.

ehfd commented 3 years ago

Any updates? @klueska I was able to start up an unprivileged X server inside an OCI Docker container with nvidia-docker in https://github.com/ehfd/docker-nvidia-glx-desktop, but thinking ahead to Wayland support (since the 470 driver is out), we likely this.

ehfd commented 2 years ago

Please use https://gitlab.com/arm-research/smarter/smarter-device-manager for /dev/dri/card and /dev/dri/render if you stumble upon this issue.

ehfd commented 2 years ago

EGL does not require /dev/dri for NVIDIA devices. VirtualGL has merged support for GLX over EGL without such devices.

ehfd commented 2 years ago

Still likely needed for Wayland with GBM.

elezar commented 2 years ago

Thanks @ehfd. We are working on improving the injection of these devices in an upcoming release. Note that the current plan is to do so using the nvidia-container-runtime at an OCI runtime specification level instead of relying on the NVIDIA Container CLI.

Do you have samples containers / test cases that you would be able to provide to ensure that we meet the requirements?

ehfd commented 2 years ago

@elezar https://github.com/ehfd/docker-nvidia-glx-desktop/blob/main/entrypoint.sh https://github.com/ehfd/docker-nvidia-egl-desktop/blob/main/entrypoint.sh

These two repositories involve a series of hacks to make NVIDIA GPUs work reliably inside a container unprivileged with a properly accelerated GUI.

docker-nvidia-glx-desktop must install the userspace driver components at startup mostly following your examples but after reading from /proc/driver/nvidia/version because libraries aren't injected to the container.

In the current state, the same userspace driver installation must be done for Wayland by reading /proc/driver/nvidia/version as well. This is undesirable.

Also, in docker-nvidia-egl-desktop, where the userspace drivers aren't installed at startup, an annoying situation arises, where Vulkan requires the display capability of NVIDIA_DRIVER_CAPABILITIES must be included because nvidia_icd.json requires libGLX_nvidia.so.0 and probably more other libraries even when not using Xorg with the NVIDIA driver.

Vulkan should be possible only with the graphics capability as intended, but it requires display as well. https://github.com/NVIDIA/nvidia-container-toolkit/issues/140 ~~Thank god it does work without major modifications to libnvidia-container.~~

And the display feature currently does not enable starting an Xorg server with the NVIDIA driver in its current state, because of the lack of the libraries being injected to the container. Hence the hacks applied by these two containers are required.

Please also consider injecting the necessary libraries for NVFBC with the video capability as well, even if the SDK must be installed inside the container.

We really hope that NVIDIA_DRIVER_CAPABILITIES starts working properly and that the hacks that my containers applied won't be needed anymore. These can all likely be done at OCI runtime specification level.

Note that we currently use https://gitlab.com/arm-research/smarter/smarter-device-manager for provisioning /dev/dri devices, but there is no methodology to push just the devices for the GPU allocated to the container.

Thanks a lot!

elezar commented 2 years ago

Thanks for all the information. I will comb through it while working on the feature. Hopefully we can improve things significantly!

Zubnix commented 2 years ago

Hi @elezar @ehfd,

I'm writing a remote Wayland compositor and am currently busy integrating it with k8s and can independently confirm everything @ehfd has stated so far, as I've hit all of them in the last couple of weeks. Being able to access /dev/dri/renderDevice12x and /dev/dri/cardx while limiting, preferably eliminating startup actions and driver dependencies of a container is an absolute must.

@elezar I'm happy to assist and answer any questions you might have to help move this forward!

elezar commented 2 years ago

Thanks @Zubnix. We have started work on injecting the /dev/dri/cardx devices as part of https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/219

I think in all cases having a list of specific devices, libraries, and environment variables that are required in a container for things to work as expected would be quite useful. We will be sure to update this issue as soon as there is something our for testing and early feedback.

ehfd commented 2 years ago

@Zubnix Hi! I've been having interest in Greenfield for a long time. Nice to meet you here! I also hope that eliminating driver dependencies of a container is very important. Thanks for your feedback! Btw, do you have any interest in using WebTransport over WebSockets in your project?

Zubnix commented 2 years ago

Hi @ehfd I've written my answer here as not to hijack this thread :)

ehfd commented 1 year ago

@elezar Hi! I saw that the /dev/dri component got merged. https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/commit/f7021d84b555b00857640681136b9b9b08fd067f

I believe that should make Wayland fundamentally work in Kubernetes.

Would it be possible to pass the below library components for enhanced X11/Wayland support? https://download.nvidia.com/XFree86/Linux-x86_64/525.78.01/README/installedcomponents.html

elezar commented 1 year ago

Thanks @ehfd I will have a look at the link you suggested.

ehfd commented 1 year ago

@elezar In specific, I feel the below are neccessary for a full X11/Wayland + OpenGL EGL/GLX + Vulkan stack without downloading the driver from the container.

Anything with AND means should be injected in either of the cases. And as you know well, the generic symlinks to the .so.525.78.01 files should be passed.

And I believe that, for practical use, everything in graphics should be injected anyways if display is specified without graphics. Else I feel that it won't work.

Configuration .json files should be added to the container like the base images do now.

(should be injected to display)
'/usr/lib/xorg/modules/drivers/nvidia_drv.so'
'/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01'
'/usr/bin/nvidia-xconfig'
'/usr/bin/nvidia-settings' + /usr/lib/libnvidia-gtk2.so.525.78.01 and on some platforms /usr/lib/libnvidia-gtk3.so.525.78.01

(should be injected to graphics AND display, probably already injected)
'/usr/lib/libGL.so.1', '/usr/lib/libEGL.so.1', '/usr/lib/libGLESv1_CM.so.525.78.01', '/usr/lib/libGLESv2.so.525.78.01', '/usr/lib/libEGL_nvidia.so.0'

(should be injected to graphics AND display)
'/usr/lib/libOpenGL.so.0', '/usr/lib/libGLX.so.0', and '/usr/lib/libGLdispatch.so.0', '/usr/lib/libnvidia-tls.so.525.78.01'

(currently injected to display only, must be injected for graphics too in order to use Vulkan)
'/usr/lib/libGLX_nvidia.so.0' and the configuration /etc/vulkan/icd.d/nvidia_icd.json

(should be injected to display AND egl, else eglinfo segfaults)
'/usr/lib/libnvidia-egl-wayland.so.1' and the config '/usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json'
'/usr/lib/libnvidia-egl-gbm.so.1' and the config '/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json'

(should be injected to video AND display)
/usr/lib/libnvidia-fbc.so.525.78.01

(should be injected to graphics AND video)
/usr/lib/libnvoptix.so.1

(should be injected to compute as there is a CUDA and CUVID dependency)
/usr/lib/libnvidia-opticalflow.so.525.78.01

(should be injected to video, not currently injected)
/usr/lib/vdpau/libvdpau_nvidia.so.525.78.01

(should be injected to video)
/usr/lib/libnvidia-encode.so.525.78.01

(should be injected to both compute AND video)
/usr/lib/libnvcuvid.so.525.78.01

(should be injected to compute if not already there)
Two OpenCL libraries (/usr/lib/libOpenCL.so.1.0.0, /usr/lib/libnvidia-opencl.so.525.78.01); the former is a vendor-independent Installable Client Driver (ICD) loader, and the latter is the NVIDIA Vendor ICD. A config file /etc/OpenCL/vendors/nvidia.icd is also installed, to advertise the NVIDIA Vendor ICD to the ICD Loader.

(should be injected to utility)
/usr/lib/libnvidia-ml.so.525.78.01

(should be injected to ngx)
/usr/lib/libnvidia-ngx.so.525.78.01
/usr/bin/nvidia-ngx-updater
/usr/lib/nvidia/wine/nvngx.dll
/usr/lib/nvidia/wine/_nvngx.dll

Various libraries that are used internally by other driver components. These include /usr/lib/libnvidia-cfg.so.525.78.01, /usr/lib/libnvidia-compiler.so.525.78.01, /usr/lib/libnvidia-eglcore.so.525.78.01, /usr/lib/libnvidia-glcore.so.525.78.01, /usr/lib/libnvidia-glsi.so.525.78.01, /usr/lib/libnvidia-glvkspirv.so.525.78.01, /usr/lib/libnvidia-rtcore.so.525.78.01, and /usr/lib/libnvidia-allocator.so.525.78.01.

ehfd commented 11 months ago

As of libnvidia-container 1.14.3-1:

/usr/lib/xorg/modules/drivers/nvidia_drv.so
/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01

libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so

These important libraries are still not provisioned.

@elezar

ehfd commented 10 months ago

@klueska @elezar A reminder for you guys... The below are the only libraries left until I can finally close this three-year-old issue and both X11 and Wayland works inside a container.

This is likely a 30 minute work for you guys.

Things mostly work now, but only after downloading .run userspace driver library files inside the container.

/usr/lib/xorg/modules/drivers/nvidia_drv.so
/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01

libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1

libnvidia-vulkan-producer.so

gbm/nvidia-drm_gbm.so

If you can't include some of these into the container toolkit, please tell us why.

elezar commented 10 months ago

@ehfd thanks for the reminder here.

Some of the libraries are already handled by the NVIDIA Container Toolkit -- with the Caveat that their detection may be distribution dependent at the moment. The main thing to change here is where we search for the libraries. There should be no technical reason for why we haven't done this and the delay is largely caused by resource constraints.

Note that in theory, if you mount these missing libraries from the host it should not be required to use the .run file to install the user space libraries in the container.

If you have capacity to contribute the changes, I would be happy to review these. Note that I would recommend making these against the NVIDIA Container Toolkit where we already inject some of the libraries that you mentioned.

ehfd commented 10 months ago

Thank you @elezar I will assess this within the NVIDIA GitLab repositories and possibly contribute code to inject these packages. Thanks!

ehfd commented 10 months ago

https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/blob/main/src/nvc_info.c https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/blob/main/internal/discover/graphics.go

These look like the code responsible.

CC @elezar @Zubnix @ABeltramo

ehfd commented 10 months ago

@elezar

The core issue seems that https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/blob/main/internal/discover/graphics.go does not invoke with Docker somehow. Perhaps this might be something with the Docker runner not being based on CDI?

elezar commented 10 months ago

To trigger the logic as linked you need to:

Use the nvidia runtime
Ensure that NVIDIA_DRIVER_CAPABILITIES includes graphics or display.

To configure the nvidia runtime for docker follow the steps described here.

Then we can run a container:

docker run --rm -ti --runtime=nvidia --gpus=all -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu

This does not require CDI support explicitly.

ehfd commented 7 months ago

Most of the above issues were probably because the PPA for graphics drivers did not install:

libnvidia-egl-gbm1
libnvidia-egl-wayland1

ehfd commented 6 months ago

@elezar I have a contribution.

https://github.com/NVIDIA/nvidia-container-toolkit/pull/490

ehfd commented 6 months ago

https://github.com/NVIDIA/nvidia-container-toolkit/pull/490#issuecomment-2104836490

More detailed situation and requirements to close this issue conclusively.

ehfd commented 4 months ago

PR to fix Wayland: https://github.com/NVIDIA/nvidia-container-toolkit/pull/548 - Merged.

New issue for X11: https://github.com/NVIDIA/nvidia-container-toolkit/issues/563

NVIDIA / libnvidia-container