NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.28k stars 245 forks source link

Dynamically expose NVIDIA X.Org X11 display server libraries and configure the container correctly #563

Open ehfd opened 3 months ago

ehfd commented 3 months ago

Also refer: https://github.com/NVIDIA/libnvidia-container/issues/118

This issue is because X11 graphical libraries are not pushed into the Windows Subsystem for Linux (WSL) and thus a full-fledged X11 server is not possible to be deployed.

Because RM_VERSION in WSL and regular Linux tend to be different, it is also not possible to download the driver libraries inside the container and unpack them.

This will also benefit regular Linux container environments.


On top of PR #548, it would be ideal if it's possible to push nvidia_drv.so and libglxserver_nvidia.so.* into /usr/lib/x86_64-linux-gnu/nvidia/xorg/ or (in a different notation) libRoot + /nvidia/xorg/ inside the container, regardless of where it was found, and the container toolkit to generate a /usr/share/X11/xorg.conf.d/10-nvidia.conf file with the following content regardless of whether 10-nvidia.conf exists in the host (mind the module path between distributions and architectures):

Section "OutputClass"
    Identifier "nvidia"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "AllowEmptyInitialConfiguration"
    ModulePath "/usr/lib/x86_64-linux-gnu/nvidia/xorg"
EndSection

The above is the default behavior for the nvidia-driver-550 from Ubuntu APT and I generally like this approach a lot.


nvidia-xconfig (and also possibly nvidia-config, combined with the NVIDIA GTK libraries and libnvidia-wayland-client.so which are dependencies) should also be pushed into the container for a full X11 experience.

Then, the X11 aspect will be solved and we developers can call it a day.

Also, injecting 32-bit libraries is definitely desirable for usage with Wine/Proton/etc.


Notes:

  --x-prefix=X-PREFIX
      The prefix under which the X components of the NVIDIA driver will be installed; the default is '/usr/X11R6'
      unless nvidia-installer detects that X.Org >= 7.0 is installed, in which case the default is '/usr'.  Only under
      rare circumstances should this option be used.

  --xfree86-prefix=XFREE86-PREFIX
      This is a deprecated synonym for --x-prefix.

  --x-module-path=X-MODULE-PATH
      The path under which the NVIDIA X server modules will be installed.  If this option is not specified,
      nvidia-installer uses the following search order and selects the first valid directory it finds: 1) `X
      -showDefaultModulePath`, 2) `pkg-config --variable=moduledir xorg-server`, or 3) the X library path (see the
      '--x-library-path' option) plus either 'modules' (for X servers older than X.Org 7.0) or 'xorg/modules' (for
      X.Org 7.0 or later).

  --x-library-path=X-LIBRARY-PATH
      The path under which the NVIDIA X libraries will be installed.  If this option is not specified, nvidia-installer
      uses the following search order and selects the first valid directory it finds: 1) `X -showDefaultLibPath`, 2)
      `pkg-config --variable=libdir xorg-server`, or 3) the X prefix (see the '--x-prefix' option) plus 'lib' on 32bit
      systems, and either 'lib64' or 'lib' on 64bit systems, depending on the installed Linux distribution.

  --x-sysconfig-path=X-SYSCONFIG-PATH
      The path under which X system configuration files will be installed.  If this option is not specified,
      nvidia-installer uses the following search order and selects the first valid directory it finds: 1) `pkg-config
      --variable=sysconfigdir xorg-server`, or 2) /usr/share/X11/xorg.conf.d.

For example, the above (visible through ./nvidia-installer -A after sh NVIDIA-Linux-x86_64-550.78.run -x) is what causes the issue here when someone installs the NVIDIA driver without any X-related libraries installed to a host meant to be a K8s node.

WARNING: nvidia-installer was forced to guess the X library path '/usr/lib64' and X module path
           '/usr/lib64/xorg/modules'; these paths were not queryable from the system.  If X fails to find the NVIDIA X
           driver module, please install the `pkg-config` utility and the X.Org SDK/development package for your
           distribution and reinstall the driver.

Leading to:

/usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so
/usr/lib64/xorg/modules/extensions/libglxserver_nvidia.so.550.78
/usr/lib64/xorg/modules/drivers/nvidia_drv.so

For container hosts, neither X (provided by xserver-xorg in Ubuntu) nor the pkg-config for xorg-server (provided by xserver-xorg-dev in Ubuntu) tend to be typically installed (especially for K8s clusters).

If they exist in Ubuntu: X -showDefaultModulePath: /usr/lib/xorg/modules pkg-config --variable=moduledir xorg-server: /usr/lib/xorg/modules X -showDefaultLibPath: /usr/lib/x86_64-linux-gnu pkg-config --variable=libdir xorg-server: /usr/lib/x86_64-linux-gnu pkg-config --variable=sysconfigdir xorg-server: /usr/share/X11/xorg.conf.d

Similarly for libglvnd: pkg-config --variable=datadir libglvnd: /usr/share

These are the candidate "environment variables" we are looking for, at least for .run installers. This is not guaranteed when a distro-provided NVIDIA driver package is installed instead of the .run file.