nvidia-cuda-validator vectorAdd fails

ClementGautier commented 3 years ago

1. Quick Debug Checklist

Are you running on an Ubuntu 18.04 node? No, I'm using Ubuntu 20.04 but it's supported as well
Are you running Kubernetes v1.13+? Yup, 1.19.16 to be precise
Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? Only containerd with systemd cgroup
Do you have i2c_core and ipmi_msghandler loaded on the nodes? ipmi_msghandler, but not i2c_core
Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces) yes

1. Issue or feature description

nvidia-cuda-validator init container fails saying that the GPU doesn't support cuda when it does.

Here are the logs:

[Vector addition of 50000 elements]
Failed to allocate device vector A (error code forward compatibility was attempted on non supported HW)!

2. Steps to reproduce the issue

Fresh Ubuntu 20.04 install with only containerd and drivers 450 installed.

3. Information to attach (optional if deemed irrelevant)

k get pods -n gpu-operator-resources
NAME                                       READY   STATUS                  RESTARTS   AGE
gpu-feature-discovery-xnr6v                1/1     Running                 0          43m
nvidia-container-toolkit-daemonset-b56sg   1/1     Running                 0          43m
nvidia-cuda-validator-test                 1/1     Running                 0          12m
nvidia-cuda-validator-wzvw6                0/1     Init:CrashLoopBackOff   6          7m43s
nvidia-dcgm-exporter-pn5kl                 1/1     Running                 3          43m
nvidia-dcgm-kccsh                          1/1     Running                 0          43m
nvidia-device-plugin-daemonset-lm422       1/1     Running                 0          43m
nvidia-operator-validator-mq2f2            0/1     Init:CrashLoopBackOff   6          43m

# ls -al /run/nvidia
total 4
drwxr-xr-x  3 root root   80 Nov  8 15:19 .
drwxr-xr-x 37 root root 1100 Nov  8 16:03 ..
-rw-r--r--  1 root root    6 Nov  8 15:19 toolkit.pid
drwxr-xr-x  2 root root   80 Nov  8 15:19 validations

# ls -al /run/nvidia/validations/
total 0
drwxr-xr-x 2 root root 80 Nov  8 15:19 .
drwxr-xr-x 3 root root 80 Nov  8 15:19 ..
-rw-r--r-- 1 root root  0 Nov  8 15:19 driver-ready
-rw-r--r-- 1 root root  0 Nov  8 15:19 toolkit-ready

# ls -al /usr/local/nvidia/toolkit/
drwxr-xr-x 3 root root    4096 Nov  8 15:19 .
drwxr-xr-x 3 root root      21 Nov  8 15:19 ..
drwxr-xr-x 3 root root      38 Nov  8 15:19 .config
lrwxrwxrwx 1 root root      28 Nov  8 15:19 libnvidia-container.so.1 -> libnvidia-container.so.1.5.1
-rwxr-xr-x 1 root root  179216 Nov  8 15:19 libnvidia-container.so.1.5.1
-rwxr-xr-x 1 root root     154 Nov  8 15:19 nvidia-container-cli
-rwxr-xr-x 1 root root   43024 Nov  8 15:19 nvidia-container-cli.real
-rwxr-xr-x 1 root root     342 Nov  8 15:19 nvidia-container-runtime
-rwxr-xr-x 1 root root     350 Nov  8 15:19 nvidia-container-runtime-experimental
-rwxr-xr-x 1 root root 3991000 Nov  8 15:19 nvidia-container-runtime.experimental
lrwxrwxrwx 1 root root      24 Nov  8 15:19 nvidia-container-runtime-hook -> nvidia-container-toolkit
-rwxr-xr-x 1 root root 2359384 Nov  8 15:19 nvidia-container-runtime.real
-rwxr-xr-x 1 root root     198 Nov  8 15:19 nvidia-container-toolkit
-rwxr-xr-x 1 root root 2147896 Nov  8 15:19 nvidia-container-toolkit.real

I also activated the debug flag on the nvidia-container-runtime as suggested here but I don't see much useful logs in there even after I restart the pod:

# tail -n 50 /var/log/nvidia-container-runtime.log
2021/11/08 16:16:07 No modification required
2021/11/08 16:16:07 Forwarding command to runtime
2021/11/08 16:16:07 Bundle directory path is empty, using working directory.
2021/11/08 16:16:07 Using bundle directory: /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2cdfae12db59fba3edf4d05b3b31caa93db9f0fd312a2a18f2999c6ca71aff55
2021/11/08 16:16:07 Using OCI specification file path: /run/containerd/io.containerd.runtime.v1.linux/k8s.io/2cdfae12db59fba3edf4d05b3b31caa93db9f0fd312a2a18f2999c6ca71aff55/config.json
2021/11/08 16:16:07 Looking for runtime binary 'docker-runc'
2021/11/08 16:16:07 Runtime binary 'docker-runc' not found: exec: "docker-runc": executable file not found in $PATH
2021/11/08 16:16:07 Looking for runtime binary 'runc'
2021/11/08 16:16:07 Found runtime binary '/usr/bin/runc'
2021/11/08 16:16:07 Running /usr/local/nvidia/toolkit/nvidia-container-runtime.real

[root@nvidia-cuda-validator-test /]# nvidia-smi
Mon Nov  8 15:13:46 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.142.00   Driver Version: 450.142.00   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  On   | 00000000:02:00.0 Off |                  N/A |
| 20%   45C    P8    18W / 250W |      1MiB / 11176MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  On   | 00000000:04:00.0 Off |                  N/A |
| 20%   30C    P8     8W / 250W |      1MiB / 11178MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

From a container (image ml-workspace-gpu) based on cuda I successfully get nvcc running:

▶ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Sun_Feb_14_21:12:58_PST_2021
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

ClementGautier commented 3 years ago

Soooo I fixed this using the latest drivers (470) instead of the 450 I was using. I guess there is a missmatch version between the cuda samples used in the image and the driver.

skirsten commented 2 years ago

Hi,

I am having exactly the same problem. Everything was fine for the 510 driver but for 470 I get this exact same error. The problem is, according to the GPU Operator Component Matrix I should be 100% supported:

GPU Operator v1.11.0
Ubuntu 20.04
The driver that is installed by the GPU Operator is 470.129.06

The only strange thing I found is that the node-labels are wrong:

nvidia.com/cuda.runtime.major=11
nvidia.com/cuda.runtime.minor=7

but according to nvidia-smi version 11.4 is installed.

Did you encounter this problem again or have a idea how to fix this?

skirsten commented 2 years ago

I was actually able to fix the validation by overwriting the container to a old version:

validator:
  version: "v1.9.1"

NVIDIA / gpu-operator