Closed ClementGautier closed 3 years ago
Soooo I fixed this using the latest drivers (470) instead of the 450 I was using. I guess there is a missmatch version between the cuda samples used in the image and the driver.
Hi,
I am having exactly the same problem. Everything was fine for the 510 driver but for 470 I get this exact same error. The problem is, according to the GPU Operator Component Matrix I should be 100% supported:
The only strange thing I found is that the node-labels are wrong:
nvidia.com/cuda.runtime.major=11
nvidia.com/cuda.runtime.minor=7
but according to nvidia-smi version 11.4 is installed.
Did you encounter this problem again or have a idea how to fix this?
I was actually able to fix the validation by overwriting the container to a old version:
validator:
version: "v1.9.1"
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes? ipmi_msghandler, but not i2c_corekubectl describe clusterpolicies --all-namespaces
) yes1. Issue or feature description
nvidia-cuda-validator init container fails saying that the GPU doesn't support cuda when it does.
Here are the logs:
2. Steps to reproduce the issue
Fresh Ubuntu 20.04 install with only containerd and drivers 450 installed.
3. Information to attach (optional if deemed irrelevant)
I also activated the debug flag on the nvidia-container-runtime as suggested here but I don't see much useful logs in there even after I restart the pod:
From a container (image ml-workspace-gpu) based on cuda I successfully get nvcc running: