NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.65k stars 274 forks source link

Containerd K3S operator #238

Open wstarzak opened 2 years ago

wstarzak commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

1. Issue or feature description

The containerd support is present but since k3s is templating it it looks like its not starting up. I have created tmpl according to k3s documentation for containerd and pointed sock to the valid path.

   - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"`
[plugins.cri]
  enable_selinux = false
  sandbox_image = "docker.io/rancher/pause:3.1"
  stream_server_address = "127.0.0.1"
  stream_server_port = "10010"

  [plugins.cri.cni]
    bin_dir = "{{ .NodeConfig.AgentConfig.CNIBinDir }}"
    conf_dir = "{{ .NodeConfig.AgentConfig.CNIConfDir }}"

  [plugins.cri.containerd]
    disable_snapshot_annotations = true
    snapshotter = "overlayfs"

    [plugins.cri.containerd.runtimes]

      [plugins.cri.containerd.runtimes.nvidia]
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = "io.containerd.runc.v2"

      [plugins.cri.containerd.runtimes.nvidia.options]
        BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

      [plugins.cri.containerd.runtimes.runc]
        runtime_type = "io.containerd.runc.v2"

[plugins.opt]
  path = "/var/lib/rancher/k3s/agent/containerd"

I do get: MountVolume.SetUp failed for volume "nvidia-operator-validator-token-9dbxf" : failed to sync secret cache: timed out waiting for the condition from the operator validator and i cant see present devices in /dev, rest of the pods are running ok.

2. Steps to reproduce the issue

  1. Install fres k3s with containerd
  2. Install GPU operator with values provided above.

3. Information to attach (optional if deemed irrelevant)

lsfiles.txt

shivamerla commented 2 years ago

@wstarzak can you attach output of kubectl describe pod <operator-validator-pod-name> -n gpu-operator-resources and also logs from failing initContainer kubectl logs <operator-validator-pod-name> -c <init-container-name> -n gpu-operator-resources. All other pods are running fine? Can you paste the output of kubectl get pods -n gpu-operator-resources?

maxirus commented 2 years ago

I was "sorta" able to get this to work...

Environment:

OS: Ubuntu 20.04.3 LTS (5.4.0-91-generic) Kubernetes: K3s v1.22.3+k3s1 Containerd: 1.5.7-k3s2 GPU: P620

Steps taken

1. Verify NVIDIA device

After a lot of frustration, I discovered the GPU wasn't listed in lspci. I would suggest as a first step to make sure the node shows your NVIDIA card by running lspci | grep NVIDIA.

NOTE: In my particular case, using a Lenovo P340 Tiny, I had to have the VGA Mode set to Auto in the BIOS.

2. Disable Nouveau

Disable Nouveau on the GPU node. Follow the docs here.

NOTE: I didn't see this called out in the Operator docs any more so this may no longer be needed

3. Install Drivers

I could not get the install driver functionality of the Helm Chart/Operator to work. I suspect it only supports the Enterprise/Datacenter GPUs. I imagine this could be customized to install the appropriate drivers for my GPU but for me it was easier to simply install the drivers manually. To do this, run:

apt-get update && \
apt-get install nvidia-headless-470-server nvidia-utils-470-server

NOTE: 470 was the latest as of this writing. Check for the latest by running apt-get search nvidia-driver

4. Install Operator

Install the Operator via the Helm Chart. You can mostly follow the directions here but use the following Helm values:

dcgm:
  enabled: false
migManager:
  enabled: false
driver:
  enabled: false
toolkit:
  enabled: true
  env:
    - name: CONTAINERD_CONFIG
      value: /var/lib/rancher/k3s/agent/etc/containerd/config.toml
    - name: CONTAINERD_SOCKET
      value: /run/k3s/containerd/containerd.sock
    - name: CONTAINERD_RUNTIME_CLASS
      value: nvidia
    - name: CONTAINERD_SET_AS_DEFAULT
      value: "true"

For me, I needed to disable the Datacenter features of dcgm and migManager as I am not using a DC GPU. You need to override the default containerd values as k3s installs these in a non-default location. By doing this, I noticed my /var/lib/rancher/k3s/agent/etc/containerd/config.toml was updated with:

[plugins.cri.containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"

[plugins.cri.containerd.runtimes."nvidia-experimental"]
  runtime_type = "io.containerd.runc.v2"
[plugins.cri.containerd.runtimes."nvidia-experimental".options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime-experimental"

This is a bit different than what the Docs say and is what got me hung-up.

5. Test

After a bit of time, the Pods eventually go all green. Now here's where the "sorta" comes in... I tried running the test Pod described here but I got the error: Failed to allocate device vector A (error code CUDA driver version is insufficient for CUDA runtime version)!.

I then tried running the below Pod:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  restartPolicy: Never
  containers:
    - name: gpu
      image: "nvidia/cuda:11.4.1-base-ubuntu20.04"
      command: [ "/bin/bash", "-c", "--" ]
      args: [ "while true; do sleep 30; done;" ]
      resources:
        limits:
          nvidia.com/gpu: 1

exec'd into the Pod, installed nvidia-utils (apt install nvidia-utils-470-server), and ran nvidia-smi which gave me the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.82.01    Driver Version: 470.82.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P620         Off  | 00000000:02:00.0 Off |                  N/A |
| 34%   35C    P8    N/A /  N/A |      0MiB /  2000MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I believe by running this command in the container and that it shows the NVIDIA GPU, it means it's working.

maxirus commented 2 years ago

UPDATE:

The above does work but it should be noted that the NVIDIA Operator hijacks the /var/lib/rancher/k3s/agent/etc/containerd/config.toml and doesn't allow changes.

mbana commented 2 weeks ago

I hope this helps.

This works for me:

$ docker run --rm -it --gpus all nvidia/cuda:12.2.2-base-ubuntu22.04 nvidia-smi

Wed Jun 19 12:20:48 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 4000                On  |   00000000:05:00.0 Off |                  N/A |
| 30%   31C    P8             17W /  125W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

==========
== CUDA ==
==========

CUDA Version 12.3.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
$ docker run --rm -it --gpus all nvidia/cuda:12.3.1-runtime-ubuntu22.04 nvidia-smi

Wed Jun 19 12:20:49 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Quadro RTX 4000                On  |   00000000:05:00.0 Off |                  N/A |
| 30%   31C    P8             17W /  125W |       1MiB /   8192MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

$ mkdir -pv ~/.kube
$ curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="server --disable=traefik,servicelb" sh -
$ sudo cp -v /etc/rancher/k3s/k3s.yaml ~/.kube/config
$ sudo chown "${USER}":"$(id -gn)" ~/.kube/config
$ sudo chmod og-r ~/.kube/config
$ kubectl get svc -A
$ helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
$ helm repo update
$ helm install --wait --generate-name --create-namespace -n gpu-operator nvidia/gpu-operator --set driver.enabled=false --set toolkit.enabled=false
$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: cuda-sample-vectoradd
spec:
  restartPolicy: OnFailure
  runtimeClassName: nvidia
  containers:
  - name: cuda-sample-vectoradd
    image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
    resources:
      limits:
        nvidia.com/gpu: 1
EOF
$ echo "Expecting:
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done"
$ echo "Actual:
$(kubectl logs cuda-sample-vectoradd)"