NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.7k stars 614 forks source link

Why there is no GPU resource allocatable on a GPU cloud instance #834

Open shizhouhu opened 2 months ago

shizhouhu commented 2 months ago

when i describe node, there is no gpu resource, why?

Capacity:
  cpu:                48
  ephemeral-storage:  574137520Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263603720Ki
  pods:               110
Allocatable:
  cpu:                48
  ephemeral-storage:  529125137556
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             263501320Ki
  pods:               110

(this is the node description)

  1. I have installed nvidia driver
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P4                       Off | 00000000:86:00.0 Off |                    0 |
| N/A   28C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P4                       Off | 00000000:87:00.0 Off |                    0 |
| N/A   29C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla P4                       Off | 00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla P4                       Off | 00000000:D8:00.0 Off |                    0 |
| N/A   31C    P8               6W /  75W |      4MiB /  7680MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

(this is nvidia driver for tesla p4)

  1. I have installed nvidia container toolkit, and configured the runtime as containerd
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"

    (this is the containerd config for nvidia container runtime)

3.I have installed nvidia k8s plugin nvidia-device-plugin

NAMESPACE      NAME                                      READY   STATUS    RESTARTS      AGE
kube-flannel   kube-flannel-ds-x2pzs                     1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-2k9mg                  1/1     Running   2 (16h ago)   7d18h
kube-system    coredns-66f779496c-nr6tz                  1/1     Running   2 (16h ago)   7d18h
kube-system    etcd-ubuntu-2288h-v5                      1/1     Running   3 (16h ago)   7d18h
kube-system    kube-apiserver-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    kube-controller-manager-ubuntu-2288h-v5   1/1     Running   3 (16h ago)   7d18h
kube-system    kube-proxy-p6gk9                          1/1     Running   2 (16h ago)   7d18h
kube-system    kube-scheduler-ubuntu-2288h-v5            1/1     Running   3 (16h ago)   7d18h
kube-system    metrics-server-6875467c8d-k6sd6           1/1     Running   2 (16h ago)   2d15h
kube-system    nvidia-device-plugin-daemonset-57kxg      1/1     Running   0             10h

(this is the nvidia device plugin for k8s)

does anyone know the problem? thanks.

jaffe-fly commented 2 months ago

Having the same problem

jaffe-fly commented 1 month ago

you need install GFD or label you node

Bugaoxingxx commented 3 weeks ago

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

shizhouhu commented 5 days ago

you need install GFD or label you node

thanks, will try

shizhouhu commented 5 days ago

add parameter while generate containerd config

nvidia-ctk runtime configure --runtime=containerd --set-as-default

thanks