Open Dragoncell opened 6 months ago
Could not find ld.so.cache
These warnings can safely be ignored. Using the ldcache is intended for interactive use cases where the driver libraries have already been loaded. I will check whether we can make the logging more useful though.
Feature related stuff
/
and not at /host/home/kubernetes
? The detection logic would have to be updated to properly locate these.gsp*.bin
: we do support custom firmware paths and I would have to check why this is not working as expected. I assume it is because we're assuming a hostDriverRoot
prefix here. It should be noted that it has been pointed out internally that it should not be needed to inject the firmware and that we can ignore this warning.k8s device plugin only
For the graphics libraries, we would have to confirm that these are actually present in the driver installation rooted at /host/home/kubernetes
. For nvidia_icd.json
specifically we search /etc
, /usr/local/share
, /usr/share
by default for vulkan/icd.d/nvidia_icd.json
explicitly. This definitely means that we're missing the /home/kubernetes/bin/nvidia/vulkan/icd.d/nvidia_icd.json
file. Could you provide a list of the files installed by your driver container?
/home/kubernetes/bin/nvidia/firmware/nvidia/535.129.03 $ ls
gsp_ga10x.bin gsp_tu10x.bin
list of files installed under /home/kubenetes/bin/nvidia, so seems like COS only install vulkan for now
/home/kubernetes/bin/nvidia $ ls
NVIDIA-Linux-x86_64-535.129.03.run bin-workdir drivers-workdir lib64 nvidia-drivers-535.129.03.tgz share vulkan
bin drivers firmware lib64-workdir nvidia-installer.log toolkit
so from this triage, we have two actions:
vulkan/icd.d/nvidia_icd.json
The gsp firmware warnings could be addressed by https://github.com/NVIDIA/nvidia-container-toolkit/pull/317 (although it would have to be rebased and reworked).
@Dragoncell I have created https://github.com/NVIDIA/nvidia-container-toolkit/pull/529 with a proposed change to locate the Vulkan ICD files as availalbe in the GKE driver.
Setup:
With custom change of GPU Operator https://github.com/NVIDIA/gpu-operator/compare/master...Dragoncell:gpu-operator:master-gke
Using below command to install the GPU Operator using CDI enabled with COS installed GPU driver
During the CDI creation either in toolkit container for management cdi spec, or in k8s device plugin for workload cdi spec, there are a few warning level logs.
Both:
Could not find ld.so.cache
Feature related stuff
k8s device plugin only
Wondering is there any warning worth further investigation ? For example
vulkan/icd.d/nvidia_icd.json
, it is actually under like