Open wstarzak opened 3 years ago
@wstarzak can you attach output of kubectl describe pod <operator-validator-pod-name> -n gpu-operator-resources
and also logs from failing initContainer kubectl logs <operator-validator-pod-name> -c <init-container-name> -n gpu-operator-resources
. All other pods are running fine? Can you paste the output of kubectl get pods -n gpu-operator-resources
?
I was "sorta" able to get this to work...
OS: Ubuntu 20.04.3 LTS (5.4.0-91-generic) Kubernetes: K3s v1.22.3+k3s1 Containerd: 1.5.7-k3s2 GPU: P620
After a lot of frustration, I discovered the GPU wasn't listed in lspci
. I would suggest as a first step to make sure the node shows your NVIDIA card by running lspci | grep NVIDIA
.
NOTE: In my particular case, using a Lenovo P340 Tiny, I had to have the VGA Mode set to
Auto
in the BIOS.
Disable Nouveau on the GPU node. Follow the docs here.
NOTE: I didn't see this called out in the Operator docs any more so this may no longer be needed
I could not get the install driver functionality of the Helm Chart/Operator to work. I suspect it only supports the Enterprise/Datacenter GPUs. I imagine this could be customized to install the appropriate drivers for my GPU but for me it was easier to simply install the drivers manually. To do this, run:
apt-get update && \
apt-get install nvidia-headless-470-server nvidia-utils-470-server
NOTE: 470 was the latest as of this writing. Check for the latest by running
apt-get search nvidia-driver
Install the Operator via the Helm Chart. You can mostly follow the directions here but use the following Helm values:
dcgm:
enabled: false
migManager:
enabled: false
driver:
enabled: false
toolkit:
enabled: true
env:
- name: CONTAINERD_CONFIG
value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
- name: CONTAINERD_SOCKET
value: /run/k3s/containerd/containerd.sock
- name: CONTAINERD_RUNTIME_CLASS
value: nvidia
- name: CONTAINERD_SET_AS_DEFAULT
value: "true"
For me, I needed to disable the Datacenter features of dcgm
and migManager
as I am not using a DC GPU. You need to override the default containerd values as k3s installs these in a non-default location. By doing this, I noticed my /var/lib/rancher/k3s/agent/etc/containerd/config.toml
was updated with:
[plugins.cri.containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
[plugins.cri.containerd.runtimes."nvidia-experimental"]
runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia-experimental".options]
BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"
This is a bit different than what the Docs say and is what got me hung-up.
After a bit of time, the Pods eventually go all green. Now here's where the "sorta" comes in... I tried running the test Pod described here but I got the error: Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!
.
I then tried running the below Pod:
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
spec:
restartPolicy: Never
containers:
- name: gpu
image: "nvidia/cuda:11.4.1-base-ubuntu20.04"
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
resources:
limits:
nvidia.com/gpu: 1
exec'd into the Pod, installed nvidia-utils (apt install nvidia-utils-470-server
), and ran nvidia-smi
which gave me the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01 Driver Version: 470.82.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P620 Off | 00000000:02:00.0 Off | N/A |
| 34% 35C P8 N/A / N/A | 0MiB / 2000MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
I believe by running this command in the container and that it shows the NVIDIA GPU, it means it's working.
UPDATE:
The above does work but it should be noted that the NVIDIA Operator hijacks the /var/lib/rancher/k3s/agent/etc/containerd/config.toml
and doesn't allow changes.
I hope this helps.
This works for me:
$ docker run --rm -it --gpus all nvidia/cuda:12.2.2-base-ubuntu22.04 nvidia-smi
Wed Jun 19 12:20:48 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro RTX 4000 On | 00000000:05:00.0 Off | N/A |
| 30% 31C P8 17W / 125W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
==========
== CUDA ==
==========
CUDA Version 12.3.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
$ docker run --rm -it --gpus all nvidia/cuda:12.3.1-runtime-ubuntu22.04 nvidia-smi
Wed Jun 19 12:20:49 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Quadro RTX 4000 On | 00000000:05:00.0 Off | N/A |
| 30% 31C P8 17W / 125W | 1MiB / 8192MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
$ mkdir -pv ~/.kube
$ curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server --disable=traefik,servicelb" sh -
$ sudo cp -v /etc/rancher/k3s/k3s.yaml ~/.kube/config
$ sudo chown "${USER}":"$(id -gn)" ~/.kube/config
$ sudo chmod og-r ~/.kube/config
$ kubectl get svc -A
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
$ helm repo update
$ helm install --wait --generate-name --create-namespace -n gpu-operator nvidia/gpu-operator --set driver.enabled=false --set toolkit.enabled=false
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: cuda-sample-vectoradd
spec:
restartPolicy: OnFailure
runtimeClassName: nvidia
containers:
- name: cuda-sample-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
nvidia.com/gpu: 1
EOF
$ echo "Expecting:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done"
$ echo "Actual:
$(kubectl logs cuda-sample-vectoradd)"
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
The containerd support is present but since k3s is templating it it looks like its not starting up. I have created tmpl according to k3s documentation for containerd and pointed sock to the valid path.
I do get:
MountVolume.SetUp failed for volume "nvidia-operator-validator-token-9dbxf" : failed to sync secret cache: timed out waiting for the condition
from the operator validator and i cant see present devices in /dev, rest of the pods are running ok.2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
ls -la /run/nvidia
ls -la /usr/local/nvidia/toolkit
ls -la /run/nvidia/driver
lsfiles.txt