NVIDIA / nvidia-container-toolkit

Build and run containers leveraging NVIDIA GPUs
Apache License 2.0
2.39k stars 256 forks source link

v0.5.0 Create CDI Spec `Could not find`, `Could not locate` on COS Triage #473

Open Dragoncell opened 6 months ago

Dragoncell commented 6 months ago

Setup:

With custom change of GPU Operator https://github.com/NVIDIA/gpu-operator/compare/master...Dragoncell:gpu-operator:master-gke

Using below command to install the GPU Operator using CDI enabled with COS installed GPU driver

helm upgrade -i --create-namespace --namespace gpu-operator noperator deployments/gpu-operator --set driver.enabled=false --set cdi.enabled=true --set cdi.default=true --set operator.runtimeClass=nvidia-cdi --set hostRoot=/ --set driverRoot=/home/kubernetes/bin/nvidia --set devRoot=/ --set operator.repository=gcr.io/jiamingxu-gke-dev --set operator.version=v0422_04 --set toolkit.installDir=/home/kubernetes/bin/nvidia --set toolkit.repository=gcr.io/jiamingxu-gke-dev  --set toolkit.version=v4 --set validator.repository=gcr.io/jiamingxu-gke-dev --set validator.version=v0417_1 --set devicePlugin.version=v0422_4 --set devicePlugin.repository=gcr.io/jiamingxu-gke-dev

During the CDI creation either in toolkit container for management cdi spec, or in k8s device plugin for workload cdi spec, there are a few warning level logs.

Both:

  1. Could not find ld.so.cache

    time="2024-04-22T19:37:03Z" level=warning msg="Could not find ld.so.cache at /host/home/kubernetes/bin/nvidia/etc/ld.so.cache; creating empty cache"
    time="2024-04-22T19:37:03Z" level=info msg="Using driver version 535.129.03"
    time="2024-04-22T19:37:03Z" level=warning msg="Could not find ld.so.cache at /host/home/kubernetes/bin/nvidia/etc/ld.so.cache; creating empty cache"
  2. Feature related stuff

    time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /nvidia-persistenced/socket: pattern /nvidia-persistenced/socket not found"
    time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /nvidia-fabricmanager/socket: pattern /nvidia-fabricmanager/socket not found"
    time="2024-04-22T19:37:03Z" level=warning msg="Could not locate /tmp/nvidia-mps: pattern /tmp/nvidia-mps not found"
    time="2024-04-22T19:37:03Z" level=warning msg="Could not locate nvidia/535.129.03/gsp*.bin: pattern nvidia/535.129.03/gsp*.bin not found"

k8s device plugin only

time="2024-04-22T19:37:22Z" level=warning msg="Could not locate glvnd/egl_vendor.d/10_nvidia.json: pattern glvnd/egl_vendor.d/10_nvidia.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_icd.json: pattern vulkan/icd.d/nvidia_icd.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/icd.d/nvidia_layers.json: pattern vulkan/icd.d/nvidia_layers.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate vulkan/implicit_layer.d/nvidia_layers.json: pattern vulkan/implicit_layer.d/nvidia_layers.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate egl/egl_external_platform.d/15_nvidia_gbm.json: pattern egl/egl_external_platform.d/15_nvidia_gbm.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate egl/egl_external_platform.d/10_nvidia_wayland.json: pattern egl/egl_external_platform.d/10_nvidia_wayland.json not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/nvoptix.bin: pattern nvidia/nvoptix.bin not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate X11/xorg.conf.d/10-nvidia.conf: pattern X11/xorg.conf.d/10-nvidia.conf not found"
....
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/nvidia_drv.so: pattern nvidia/xorg/nvidia_drv.so not found"
time="2024-04-22T19:37:22Z" level=warning msg="Could not locate nvidia/xorg/libglxserver_nvidia.so.535.129.03: pattern nvidia/xorg/libglxserver_nvidia.so.535.129.03 not found"

Wondering is there any warning worth further investigation ? For example vulkan/icd.d/nvidia_icd.json, it is actually under like

/home/kubernetes/bin/nvidia/vulkan/icd.d $ ls
nvidia_icd.json
elezar commented 6 months ago

Could not find ld.so.cache

These warnings can safely be ignored. Using the ldcache is intended for interactive use cases where the driver libraries have already been loaded. I will check whether we can make the logging more useful though.

Feature related stuff

k8s device plugin only

For the graphics libraries, we would have to confirm that these are actually present in the driver installation rooted at /host/home/kubernetes. For nvidia_icd.json specifically we search /etc, /usr/local/share, /usr/share by default for vulkan/icd.d/nvidia_icd.json explicitly. This definitely means that we're missing the /home/kubernetes/bin/nvidia/vulkan/icd.d/nvidia_icd.json file. Could you provide a list of the files installed by your driver container?

Dragoncell commented 6 months ago

list of files installed under /home/kubenetes/bin/nvidia, so seems like COS only install vulkan for now

 /home/kubernetes/bin/nvidia $ ls
NVIDIA-Linux-x86_64-535.129.03.run  bin-workdir  drivers-workdir  lib64          nvidia-drivers-535.129.03.tgz  share    vulkan
bin                                 drivers      firmware         lib64-workdir  nvidia-installer.log           toolkit

so from this triage, we have two actions:

  1. [Optional] gsp*.bin: investigate it why this is not working as expected
  2. Add driverRoot path search for file vulkan/icd.d/nvidia_icd.json
elezar commented 6 months ago

The gsp firmware warnings could be addressed by https://github.com/NVIDIA/nvidia-container-toolkit/pull/317 (although it would have to be rebased and reworked).

elezar commented 4 months ago

@Dragoncell I have created https://github.com/NVIDIA/nvidia-container-toolkit/pull/529 with a proposed change to locate the Vulkan ICD files as availalbe in the GKE driver.