Open ehfd opened 3 years ago
I have added this feature request to our backlog. At present we have a big backlog, so it's unclear exactly when we will be able to look at this in detail.
That said, it feels like it could be added as a new NVIDIA_DRIVER_CAPABILITY
that tries to look for these devices if they exist and inject them. You would set this capability either in the container image or the command line via an environment variable (which would work in the k8s context as well).
As you see the thumbs up, this feature is in quite a big demand, so it would be great to be implemented quickly. Thank you.
If you get a chance to do that, maybe add /dev/gdrdrv
for nvidia gdrcopy as well.
To use a custom base image, share all files matching
/dev/nvidia*
,/dev/nvhost*
and/dev/nvmap
with docker run option--device
. Share/dev/dri
and/dev/vga_arbiter
, too. Add container user to groupsvideo
andrender
with--group-add video --group-add render
.
In addition to the initial feature request, these are all the devices required to be provisioned automatically for NVIDIA to officially support Display (e.g. X11, Wayland) in Docker. If these devices are able to be provisioned using the container toolkit automatically, the nvidia/opengl container (nvidia-docker) can properly support the NVIDIA version of XWayland (currently undergoing support into the Linux kernel by NVIDIA devs) and thus support Displays.
There are a lot of people waiting for Display support in Docker and Kubernetes, especially because NVIDIA is to support XWayland in the near future. Please implement this feature to streamline this.
Any updates? @klueska I was able to start up an unprivileged X server inside an OCI Docker container with nvidia-docker in https://github.com/ehfd/docker-nvidia-glx-desktop, but thinking ahead to Wayland support (since the 470 driver is out), we likely this.
Please use https://gitlab.com/arm-research/smarter/smarter-device-manager for /dev/dri/card and /dev/dri/render if you stumble upon this issue.
EGL does not require /dev/dri
for NVIDIA devices. VirtualGL has merged support for GLX over EGL without such devices.
Still likely needed for Wayland with GBM.
Thanks @ehfd. We are working on improving the injection of these devices in an upcoming release. Note that the current plan is to do so using the nvidia-container-runtime
at an OCI runtime specification level instead of relying on the NVIDIA Container CLI.
Do you have samples containers / test cases that you would be able to provide to ensure that we meet the requirements?
@elezar https://github.com/ehfd/docker-nvidia-glx-desktop/blob/main/entrypoint.sh https://github.com/ehfd/docker-nvidia-egl-desktop/blob/main/entrypoint.sh
These two repositories involve a series of hacks to make NVIDIA GPUs work reliably inside a container unprivileged with a properly accelerated GUI.
docker-nvidia-glx-desktop must install the userspace driver components at startup mostly following your examples but after reading from /proc/driver/nvidia/version
because libraries aren't injected to the container.
In the current state, the same userspace driver installation must be done for Wayland by reading /proc/driver/nvidia/version
as well. This is undesirable.
Also, in docker-nvidia-egl-desktop, where the userspace drivers aren't installed at startup, an annoying situation arises, where Vulkan requires the display
capability of NVIDIA_DRIVER_CAPABILITIES
must be included because nvidia_icd.json
requires libGLX_nvidia.so.0
and probably more other libraries even when not using Xorg with the NVIDIA driver.
Vulkan should be possible only with the graphics
capability as intended, but it requires display
as well. https://github.com/NVIDIA/nvidia-container-toolkit/issues/140 Thank god it does work without major modifications to libnvidia-container.
And the display
feature currently does not enable starting an Xorg server with the NVIDIA driver in its current state, because of the lack of the libraries being injected to the container. Hence the hacks applied by these two containers are required.
Please also consider injecting the necessary libraries for NVFBC with the video
capability as well, even if the SDK must be installed inside the container.
We really hope that NVIDIA_DRIVER_CAPABILITIES
starts working properly and that the hacks that my containers applied won't be needed anymore. These can all likely be done at OCI runtime specification level.
Note that we currently use https://gitlab.com/arm-research/smarter/smarter-device-manager for provisioning /dev/dri devices, but there is no methodology to push just the devices for the GPU allocated to the container.
Thanks a lot!
Thanks for all the information. I will comb through it while working on the feature. Hopefully we can improve things significantly!
Hi @elezar @ehfd,
I'm writing a remote Wayland compositor and am currently busy integrating it with k8s and can independently confirm everything @ehfd has stated so far, as I've hit all of them in the last couple of weeks. Being able to access /dev/dri/renderDevice12x
and /dev/dri/cardx
while limiting, preferably eliminating startup actions and driver dependencies of a container is an absolute must.
@elezar I'm happy to assist and answer any questions you might have to help move this forward!
Thanks @Zubnix. We have started work on injecting the /dev/dri/cardx
devices as part of https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/merge_requests/219
I think in all cases having a list of specific devices, libraries, and environment variables that are required in a container for things to work as expected would be quite useful. We will be sure to update this issue as soon as there is something our for testing and early feedback.
@Zubnix Hi! I've been having interest in Greenfield for a long time. Nice to meet you here! I also hope that eliminating driver dependencies of a container is very important. Thanks for your feedback! Btw, do you have any interest in using WebTransport over WebSockets in your project?
@elezar Hi! I saw that the /dev/dri component got merged. https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/commit/f7021d84b555b00857640681136b9b9b08fd067f
I believe that should make Wayland fundamentally work in Kubernetes.
Would it be possible to pass the below library components for enhanced X11/Wayland support? https://download.nvidia.com/XFree86/Linux-x86_64/525.78.01/README/installedcomponents.html
Thanks @ehfd I will have a look at the link you suggested.
@elezar In specific, I feel the below are neccessary for a full X11/Wayland + OpenGL EGL/GLX + Vulkan stack without downloading the driver from the container.
Anything with AND means should be injected in either of the cases. And as you know well, the generic symlinks to the .so.525.78.01 files should be passed.
And I believe that, for practical use, everything in graphics
should be injected anyways if display
is specified without graphics
. Else I feel that it won't work.
Configuration .json files should be added to the container like the base images do now.
(should be injected to display)
'/usr/lib/xorg/modules/drivers/nvidia_drv.so'
'/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01'
'/usr/bin/nvidia-xconfig'
'/usr/bin/nvidia-settings' + /usr/lib/libnvidia-gtk2.so.525.78.01 and on some platforms /usr/lib/libnvidia-gtk3.so.525.78.01
(should be injected to graphics AND display, probably already injected)
'/usr/lib/libGL.so.1', '/usr/lib/libEGL.so.1', '/usr/lib/libGLESv1_CM.so.525.78.01', '/usr/lib/libGLESv2.so.525.78.01', '/usr/lib/libEGL_nvidia.so.0'
(should be injected to graphics AND display)
'/usr/lib/libOpenGL.so.0', '/usr/lib/libGLX.so.0', and '/usr/lib/libGLdispatch.so.0', '/usr/lib/libnvidia-tls.so.525.78.01'
(currently injected to display only, must be injected for graphics too in order to use Vulkan)
'/usr/lib/libGLX_nvidia.so.0' and the configuration /etc/vulkan/icd.d/nvidia_icd.json
(should be injected to display AND egl, else eglinfo segfaults)
'/usr/lib/libnvidia-egl-wayland.so.1' and the config '/usr/share/egl/egl_external_platform.d/10_nvidia_wayland.json'
'/usr/lib/libnvidia-egl-gbm.so.1' and the config '/usr/share/egl/egl_external_platform.d/15_nvidia_gbm.json'
(should be injected to video AND display)
/usr/lib/libnvidia-fbc.so.525.78.01
(should be injected to graphics AND video)
/usr/lib/libnvoptix.so.1
(should be injected to compute as there is a CUDA and CUVID dependency)
/usr/lib/libnvidia-opticalflow.so.525.78.01
(should be injected to video, not currently injected)
/usr/lib/vdpau/libvdpau_nvidia.so.525.78.01
(should be injected to video)
/usr/lib/libnvidia-encode.so.525.78.01
(should be injected to both compute AND video)
/usr/lib/libnvcuvid.so.525.78.01
(should be injected to compute if not already there)
Two OpenCL libraries (/usr/lib/libOpenCL.so.1.0.0, /usr/lib/libnvidia-opencl.so.525.78.01); the former is a vendor-independent Installable Client Driver (ICD) loader, and the latter is the NVIDIA Vendor ICD. A config file /etc/OpenCL/vendors/nvidia.icd is also installed, to advertise the NVIDIA Vendor ICD to the ICD Loader.
(should be injected to utility)
/usr/lib/libnvidia-ml.so.525.78.01
(should be injected to ngx)
/usr/lib/libnvidia-ngx.so.525.78.01
/usr/bin/nvidia-ngx-updater
/usr/lib/nvidia/wine/nvngx.dll
/usr/lib/nvidia/wine/_nvngx.dll
Various libraries that are used internally by other driver components. These include /usr/lib/libnvidia-cfg.so.525.78.01, /usr/lib/libnvidia-compiler.so.525.78.01, /usr/lib/libnvidia-eglcore.so.525.78.01, /usr/lib/libnvidia-glcore.so.525.78.01, /usr/lib/libnvidia-glsi.so.525.78.01, /usr/lib/libnvidia-glvkspirv.so.525.78.01, /usr/lib/libnvidia-rtcore.so.525.78.01, and /usr/lib/libnvidia-allocator.so.525.78.01.
As of libnvidia-container 1.14.3-1
:
/usr/lib/xorg/modules/drivers/nvidia_drv.so
/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01
libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1
libnvidia-vulkan-producer.so
gbm/nvidia-drm_gbm.so
These important libraries are still not provisioned.
@elezar
@klueska @elezar A reminder for you guys... The below are the only libraries left until I can finally close this three-year-old issue and both X11 and Wayland works inside a container.
This is likely a 30 minute work for you guys.
Things mostly work now, but only after downloading .run
userspace driver library files inside the container.
/usr/lib/xorg/modules/drivers/nvidia_drv.so
/usr/lib/xorg/modules/extensions/libglxserver_nvidia.so.525.78.01
libnvidia-egl-gbm.so.1
libnvidia-egl-wayland.so.1
libnvidia-vulkan-producer.so
gbm/nvidia-drm_gbm.so
If you can't include some of these into the container toolkit, please tell us why.
@ehfd thanks for the reminder here.
Some of the libraries are already handled by the NVIDIA Container Toolkit -- with the Caveat that their detection may be distribution dependent at the moment. The main thing to change here is where we search for the libraries. There should be no technical reason for why we haven't done this and the delay is largely caused by resource constraints.
Note that in theory, if you mount these missing libraries from the host it should not be required to use the .run file to install the user space libraries in the container.
If you have capacity to contribute the changes, I would be happy to review these. Note that I would recommend making these against the NVIDIA Container Toolkit where we already inject some of the libraries that you mentioned.
Thank you @elezar I will assess this within the NVIDIA GitLab repositories and possibly contribute code to inject these packages. Thanks!
https://gitlab.com/nvidia/container-toolkit/libnvidia-container/-/blob/main/src/nvc_info.c https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/blob/main/internal/discover/graphics.go
These look like the code responsible.
CC @elezar @Zubnix @ABeltramo
@elezar
The core issue seems that https://gitlab.com/nvidia/container-toolkit/container-toolkit/-/blob/main/internal/discover/graphics.go does not invoke with Docker somehow. Perhaps this might be something with the Docker runner not being based on CDI?
To trigger the logic as linked you need to:
nvidia
runtimeNVIDIA_DRIVER_CAPABILITIES
includes graphics
or display
.To configure the nvidia
runtime for docker follow the steps described here.
Then we can run a container:
docker run --rm -ti --runtime=nvidia --gpus=all -e NVIDIA_DRIVER_CAPABILITIES=all ubuntu
This does not require CDI support explicitly.
Most of the above issues were probably because the PPA for graphics drivers did not install:
libnvidia-egl-gbm1
libnvidia-egl-wayland1
@elezar I have a contribution.
https://github.com/NVIDIA/nvidia-container-toolkit/pull/490#issuecomment-2104836490
More detailed situation and requirements to close this issue conclusively.
PR to fix Wayland: https://github.com/NVIDIA/nvidia-container-toolkit/pull/548 - Merged.
New issue for X11: https://github.com/NVIDIA/nvidia-container-toolkit/issues/563
Redirected from https://github.com/NVIDIA/k8s-device-plugin/issues/206 to a more suitable repository.
1. Issue or feature description
In docker and kubernetes, people have had to have manual host setup to provision the X server using host path directive
/tmp/.X11-unix
. This is quite tedious for sysadmins and at the same time a security threat as people can spoof the host.To mitigate this, there have been attempts (https://github.com/ehfd/docker-nvidia-glx-desktop which is based on https://github.com/ryought/glx-docker-headless-gpu) to execute an X server and use GLX inside the container after getting provisioned the GPU using libnvidia-container.
An alternative was created by the developers at VirtualGL (used widely in HPC to enable GPU-based rendering in VNC virtual display environments) have been able to develop a feature that uses the EGL API to enable 3D GL rendering such as Blender, Matlab, and Unity, previously only possible with GLX and thus an X server. As you guys know well, nvidia-docker does not support GLX but has introduced the EGL API just below two years ago. See EGL config section of https://github.com/VirtualGL/virtualgl/issues/113#issuecomment-693127236
EGL is also required to start a Wayland compositor inside a container with the EGLStreams specification in NVIDIA GPUs, which is the way forward after X11 development has stopped.
These use cases require access to the devices
/dev/dri/cardX
corresponding to each GPU provisioned using libnvidia-container. However, it does not seem like libnvidia-container provisions this automatically. I would like to ask you whether this is possible, and how this can be configured.2. Steps to reproduce the issue
Provision one GPU inside container
nvidia/cudagl:11.0-devel-ubuntu20.04
ornvidia/opengl:1.2-glvnd-devel-ubuntu20.04
in Docker CE 19.03 (or using onenvidia.com/gpu: 1
with k8s-device-plugin v0.7.0 with default configurations in Kubernetes v1.18.6).Do:
ls /dev
Result: Inside the container you see
/dev/nvidiaX
,/dev/nvidia-modeset
,/dev/nvidia-uvm
,/dev/nvidia-uvm-tools
, HOWEVER directory/dev/dri
does not exist. Wayland compositors are unlikely to start inside a container without DRM devices. VirtualGL does not work through any devices other than/dev/dri/cardX
as well.3. Information to attach (optional if deemed irrelevant)
Other issues and repositories: Example of VirtualGL EGL configuration that requires
/dev/dri/cardX
: https://github.com/ehfd/docker-nvidia-egl-desktopImplementation of an unprivileged remote desktop bundling an X server with many hacks: https://github.com/ehfd/docker-nvidia-glx-desktop