NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.49k stars 269 forks source link

The NVIDIA ICD JSON occasionally goes missing from 'nvidia-ctk cdi generate' #767

Open debarshiray opened 3 weeks ago

debarshiray commented 3 weeks ago

I have been playing with the NVIDIA Container Toolkit on Fedora 39 Workstation and the proprietary NVIDIA driver from RPM Fusion. I have noticed that the NVIDIA installable client driver (or ICD) JSON for Vulkan occasionally goes missing from nvidia-ctk cdi generate:

$ nvidia-ctk cdi generate --format yaml 2>/dev/null | grep vulkan
 - containerPath: /etc/vulkan/implicit_layer.d/nvidia_layers.json
   hostPath: /usr/share/vulkan/implicit_layer.d/nvidia_layers.json

... even though the file is present on the host operating system at /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json and Vulkan support on the host is confirmed by:

$ vulkaninfo --summary
...
...
Devices:
========
GPU0:
    apiVersion         = 1.3.280
    driverVersion      = 560.35.3.0
    vendorID           = 0x10de
    deviceID           = 0x1cbc
    deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
    deviceName         = Quadro P600
    driverID           = DRIVER_ID_NVIDIA_PROPRIETARY
    driverName         = NVIDIA
    driverInfo         = 560.35.3.0
    conformanceVersion = 1.3.8.2
    deviceUUID         = 2efa4848-ba99-ccd3-0a19-f497b31331ca
    driverUUID         = c3ca0510-c7e6-5f1c-86a1-dc0ed4ea4e21
...
...

This means that Podman containers don't have Vulkan support through the proprietary NVIDIA driver, and can only use LLVMpipe.

Right now, I am observing this problem with:

$ uname --kernel-release
6.11.4-101.fc39.x86_64
$ rpm -q kernel
kernel-6.5.6-300.fc39.x86_64
kernel-6.11.4-101.fc39.x86_64
$ rpm -q kmod-nvidia
kmod-nvidia-560.35.03-1.fc39.x86_64
debarshiray commented 3 weeks ago

I forgot to mention the NVIDIA Container Toolkit version:

$ nvidia-ctk --version
NVIDIA Container Toolkit CLI version 1.16.1
$ rpm -qf $(which nvidia-ctk)
golang-github-nvidia-container-toolkit-1.16.1-1.fc39.x86_64

Note that the NVIDIA Container Toolkit version didn't change between the NVIDIA ICD JSON for Vulkan being listed and not listed. What changed was that I pulled in the RPM updates for the rest of the Fedora host.

elezar commented 3 weeks ago

@debarshiray the host path you mention /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json is not one that we explicitly search for. Could you please confirm which package provides that file? It could be that the 560.35.3.0 driver that you're using now includes the file including the architecture string.

(Looking at some older internal documentation it seems as if this has been the case for a while).

debarshiray commented 3 weeks ago

Thanks for looking into it, @elezar !

Meanwhile, I reinstalled different versions of Fedora a few times to see if the problem is specific to a particular combination of package versions. I could reproduce it reliably on Fedora 40 and 41, which was surprising because this used to work. :)

Now with Fedora 41 Workstation and the proprietary NVIDIA driver from RPM Fusion, I see:

$ rpm --query --file /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json
xorg-x11-drv-nvidia-libs-560.35.03-5.fc41.x86_64

If I force /usr/share/vulkan/icd.d/nvidia_icd.x86_64.json to be present inside the container through an explicit bind mount then I do get Vulkan support through the proprietary NVIDIA driver.

In all cases, Vulkan support is available through the proprietary driver on the host operating system, as shown in the vulkaninfo --summary snippet above.

elezar commented 2 weeks ago

Who is the publisher of the xorg-x11-drv-nvidia-libs-560.35.03-5.fc41.x86_64 package above?

debarshiray commented 2 weeks ago

Who is the publisher of the xorg-x11-drv-nvidia-libs-560.35.03-5.fc41.x86_64 package above?

It's RPM Fusion. That's where I got the proprietary NVIDIA driver from.