NVIDIA / k8s-device-plugin

NVIDIA device plugin for Kubernetes
Apache License 2.0
2.8k stars 625 forks source link

Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod. #348

Open somethingwentwell opened 1 year ago

somethingwentwell commented 1 year ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Issue or feature description

 kubectl describe po gpu-pod
Name:             gpu-pod
Namespace:        default
Priority:         0
Service Account:  default
Node:             <none>
Labels:           <none>
Annotations:      <none>
Status:           Pending
IP:
IPs:              <none>
Containers:
  cuda-container:
    Image:      nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
    Port:       <none>
    Host Port:  <none>
    Limits:
      nvidia.com/gpu:  1
    Requests:
      nvidia.com/gpu:  1
    Environment:       <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9cp5g (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  kube-api-access-9cp5g:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
                             nvidia.com/gpu:NoSchedule op=Exists
Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  27s   default-scheduler  0/1 nodes are available: 1 Insufficient nvidia.com/gpu. preemption: 0/1 nodes are available: 1 No preemption victims found for incoming pod.

2. Steps to reproduce the issue

The VM is Ubuntu20.04

  1. Install 470.141.03 driver Output of nvidia-smi

+-----------------------------------------------------------------------------+ | NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GRID A100D-20C On | 00000000:06:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 1589MiB / 20475MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+


2. Deploy single node k8s using kubespray
Kubernetes version

Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-11T02:46:24Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.4", GitCommit:"872a965c6c6526caa949f0c6ac028ef7aff3fb78", GitTreeState:"clean", BuildDate:"2022-11-09T13:29:58Z", GoVersion:"go1.19.3", Compiler:"gc", Platform:"linux/amd64"}


3. Install container toolkit

nvidia-container-toolkit --version NVIDIA Container Runtime Hook version 1.11.0 commit: d9de4a0


4. Edit containerd config

version = 2 root = "/var/lib/containerd" state = "/run/containerd" oom_score = 0

[grpc] max_recv_message_size = 16777216 max_send_message_size = 16777216

[debug] level = "info"

[metrics] address = "" grpc_histogram = false

[plugins] [plugins."io.containerd.grpc.v1.cri"] sandbox_image = "registry.k8s.io/pause:3.7" max_container_log_line_size = -1 [plugins."io.containerd.grpc.v1.cri".containerd] default_runtime_name = "nvidia" snapshotter = "overlayfs" [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] runtime_type = "io.containerd.runc.v2" runtime_engine = "" runtime_root = "" base_runtime_spec = "/etc/containerd/cri-base.json"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
        systemdCgroup = true

    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
      privileged_without_host_devices = false
      runtime_engine = ""
      runtime_root = ""
      runtime_type = "io.containerd.runc.v2"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
        BinaryName = "/usr/bin/nvidia-container-runtime""

5.  Enabling GPU Support in Kubernetes

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.12.3/nvidia-device-plugin.yml


6. Running GPU Jobs

cat <<EOF | kubectl apply -f - apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: restartPolicy: Never containers:

3. Information to attach (optional if deemed irrelevant)

  1. containerd version

    containerd --version
    containerd github.com/containerd/containerd v1.6.10 770bd0108c32f3fb5c73ae1264f7e503fe7b2661
  2. KVM config

    
    <domain type='kvm'>
    <name>test-vm1</name>
    <uuid>695b8bef-a78a-443a-950c-66a055df670a</uuid>
    <metadata>
    <libosinfo:libosinfo xmlns:libosinfo="http://libosinfo.org/xmlns/libvirt/doma>
      <libosinfo:os id="http://ubuntu.com/ubuntu/20.04"/>
    </libosinfo:libosinfo>
    </metadata>
    <memory unit='KiB'>4194304</memory>
    <currentMemory unit='KiB'>4194304</currentMemory>
    <vcpu placement='static'>4</vcpu>
    <os>
    <type arch='x86_64' machine='pc-q35-4.2'>hvm</type>
    <boot dev='hd'/>
    </os>
    <features>
    <acpi/>
    <apic/>
    </features>
    <cpu mode='host-model' check='partial'/>
    <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
    </clock>
    <on_poweroff>destroy</on_poweroff>
    <on_reboot>restart</on_reboot>
    <on_crash>destroy</on_crash>
    <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
    </pm>
    <devices>
    <emulator>/usr/bin/qemu-system-x86_64</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='raw'/>
      <source file='/var/lib/libvirt/images/test-disk1.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='ich9-ehci1'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x7'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci1'>
      <master startport='0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x0' m>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci2'>
      <master startport='2'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x1'/>
    </controller>
    <controller type='usb' index='0' model='ich9-uhci3'>
      <master startport='4'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1d' function='0x2'/>
    </controller>
    <controller type='sata' index='0'>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x1f' function='0x2'/>
    </controller>
    <controller type='pci' index='0' model='pcie-root'/>
    <controller type='pci' index='1' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='1' port='0x8'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0' m>
    </controller>
    <controller type='pci' index='2' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='2' port='0x9'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x1'/>
    </controller>
    <controller type='pci' index='3' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='3' port='0xa'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/>
    </controller>
    <controller type='pci' index='4' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='4' port='0xb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x3'/>
    </controller>
    <controller type='pci' index='5' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='5' port='0xc'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x4'/>
    </controller>
    <controller type='pci' index='6' model='pcie-root-port'>
      <model name='pcie-root-port'/>
      <target chassis='6' port='0xd'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x5'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <address type='pci' domain='0x0000' bus='0x02' slot='0x00' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:6e:b1:69'/>
      <source bridge='virbr0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
    </interface>
    <serial type='pty'>
      <target type='isa-serial' port='0'>
        <model name='isa-serial'/>
      </target>
    </serial>
    <console type='pty'>
      <target type='serial' port='0'/>
    </console>
    <channel type='unix'>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci' display='>
      <source>
        <address uuid='b06ebd67-f9eb-4ab3-b62d-f5f3762b9011'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x06' slot='0x00' function='0x0'/>
    </hostdev>
    <memballoon model='virtio'>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </memballoon>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
      <address type='pci' domain='0x0000' bus='0x05' slot='0x00' function='0x0'/>
    </rng>
    </devices>
    </domain>


Common error checking:
 - [ ] The output of `nvidia-smi -a` on your host
 - [ ] Your docker configuration file (e.g: `/etc/docker/daemon.json`)
 - [ ] The k8s-device-plugin container logs
 - [ ] The kubelet logs on the node (e.g: `sudo journalctl -r -u kubelet`)

Additional information that might help better understand your environment and reproduce the bug:
 - [ ] Docker version from `docker version`
 - [ ] Docker command, image and tag used
 - [ ] Kernel version from `uname -a`
 - [ ] Any relevant kernel output lines from `dmesg`
 - [ ] NVIDIA packages version from `dpkg -l '*nvidia*'` _or_ `rpm -qa '*nvidia*'`
 - [ ] NVIDIA container library version from `nvidia-container-cli -V`
 - [ ] NVIDIA container library logs (see [troubleshooting](https://github.com/NVIDIA/nvidia-docker/wiki/Troubleshooting))
klueska commented 1 year ago

What do the plugin logs look like, and what resources does your node say it has under Capacity and Allocatable when running kubectl get node?

somethingwentwell commented 1 year ago

Here is the output of kubectl describe node

kubectl describe node server1
Name:               server1
Roles:              control-plane
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/os=linux
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=server1
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/control-plane=
                    node.kubernetes.io/exclude-from-external-load-balancers=
Annotations:        kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
                    node.alpha.kubernetes.io/ttl: 0
                    projectcalico.org/IPv4Address: 192.168.122.148/24
                    projectcalico.org/IPv4VXLANTunnelAddr: 10.233.79.64
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Mon, 28 Nov 2022 09:16:27 +0000
Taints:             <none>
Unschedulable:      false
Lease:
  HolderIdentity:  server1
  AcquireTime:     <unset>
  RenewTime:       Mon, 28 Nov 2022 10:22:21 +0000
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Mon, 28 Nov 2022 09:17:19 +0000   Mon, 28 Nov 2022 09:17:19 +0000   CalicoIsUp                   Calico is running on this node
  MemoryPressure       False   Mon, 28 Nov 2022 10:22:16 +0000   Mon, 28 Nov 2022 09:16:26 +0000   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Mon, 28 Nov 2022 10:22:16 +0000   Mon, 28 Nov 2022 09:16:26 +0000   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Mon, 28 Nov 2022 10:22:16 +0000   Mon, 28 Nov 2022 09:16:26 +0000   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Mon, 28 Nov 2022 10:22:16 +0000   Mon, 28 Nov 2022 09:18:05 +0000   KubeletReady                 kubelet is posting ready status. AppArmor enabled
Addresses:
  InternalIP:  192.168.122.148
  Hostname:    server1
Capacity:
  cpu:                4
  ephemeral-storage:  204794888Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             4025584Ki
  pods:               110
Allocatable:
  cpu:                3800m
  ephemeral-storage:  188738968469
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3398896Ki
  pods:               110
System Info:
  Machine ID:                 695b8befa78a443a950c66a055df670a
  System UUID:                695b8bef-a78a-443a-950c-66a055df670a
  Boot ID:                    e70f1479-1827-4387-b052-7e9a1a0d7211
  Kernel Version:             5.4.0-132-generic
  OS Image:                   Ubuntu 20.04.5 LTS
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  containerd://1.6.10
  Kubelet Version:            v1.25.4
  Kube-Proxy Version:         v1.25.4
PodCIDR:                      10.233.64.0/24
PodCIDRs:                     10.233.64.0/24
Non-terminated Pods:          (14 in total)
  Namespace                   Name                                        CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                        ------------  ----------  ---------------  -------------  ---
  cert-manager                cert-manager-55b8b5b94f-bxxbw               0 (0%)        0 (0%)      0 (0%)           0 (0%)         65m
  cert-manager                cert-manager-cainjector-655669b754-dd7qr    0 (0%)        0 (0%)      0 (0%)           0 (0%)         65m
  cert-manager                cert-manager-webhook-77d689b6df-xq25h       0 (0%)        0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 calico-kube-controllers-d6484b75c-b2d6v     30m (0%)      1 (26%)     64M (1%)         256M (7%)      65m
  kube-system                 calico-node-ndjpn                           150m (3%)     300m (7%)   64M (1%)         500M (14%)     65m
  kube-system                 coredns-588bb58b94-bhs45                    100m (2%)     0 (0%)      70Mi (2%)        300Mi (9%)     64m
  kube-system                 dns-autoscaler-d8bd87bcc-65cdd              20m (0%)      0 (0%)      10Mi (0%)        0 (0%)         64m
  kube-system                 kube-apiserver-server1                      250m (6%)     0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 kube-controller-manager-server1             200m (5%)     0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 kube-proxy-d8std                            0 (0%)        0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 kube-scheduler-server1                      100m (2%)     0 (0%)      0 (0%)           0 (0%)         65m
  kube-system                 local-volume-provisioner-8q86f              0 (0%)        0 (0%)      0 (0%)           0 (0%)         64m
  kube-system                 nodelocaldns-xlx8j                          100m (2%)     0 (0%)      70Mi (2%)        200Mi (6%)     64m
  kube-system                 nvidia-device-plugin-daemonset-7989w        0 (0%)        0 (0%)      0 (0%)           0 (0%)         54m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests        Limits
  --------           --------        ------
  cpu                950m (25%)      1300m (34%)
  memory             285286400 (8%)  1280288k (36%)
  ephemeral-storage  0 (0%)          0 (0%)
  hugepages-1Gi      0 (0%)          0 (0%)
  hugepages-2Mi      0 (0%)          0 (0%)
Events:              <none>
somethingwentwell commented 1 year ago

any update?

klueska commented 1 year ago

It seems that the plugin is not advertising any GPUs. Can you post the logs of the plugin?

Todoroki02 commented 1 year ago

Hi there!! Was the error solved? Because I am facing the same error and I am not able to solve it. Would be a huge help, if you could help me out here.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 90 days with no activity. This issue will be closed in 30 days unless new comments are made or the stale label is removed.

xlcbingo1999 commented 8 months ago

Was the error solved?

elezar commented 8 months ago

The plugin log requested in https://github.com/NVIDIA/k8s-device-plugin/issues/348#issuecomment-1369699003 were never supplied. @xlcbingo1999 if you are stting similar behaviour, please provide a description of your setup as well as the plugin logs.

Kkkassini commented 4 months ago

I0619 14:39:57.345606 1 main.go:178] Starting FS watcher. I0619 14:39:57.345911 1 main.go:185] Starting OS watcher. I0619 14:39:57.346248 1 main.go:200] Starting Plugins. I0619 14:39:57.346272 1 main.go:257] Loading configuration. I0619 14:39:57.346836 1 main.go:265] Updating config with default resource matching patterns. I0619 14:39:57.347470 1 main.go:276] Running with config: { "version": "v1", "flags": { "migStrategy": "none", "failOnInitError": false, "mpsRoot": "", "nvidiaDriverRoot": "/", "gdsEnabled": false, "mofedEnabled": false, "useNodeFeatureAPI": null, "plugin": { "passDeviceSpecs": false, "deviceListStrategy": [ "envvar" ], "deviceIDStrategy": "uuid", "cdiAnnotationPrefix": "cdi.k8s.io/", "nvidiaCTKPath": "/usr/bin/nvidia-ctk", "containerDriverRoot": "/driver-root" } }, "resources": { "gpus": [ { "pattern": "*", "name": "nvidia.com/gpu" } ] }, "sharing": { "timeSlicing": {} } } I0619 14:39:57.347485 1 main.go:279] Retrieving plugins. W0619 14:39:57.347555 1 factory.go:31] No valid resources detected, creating a null CDI handler I0619 14:39:57.347606 1 factory.go:104] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory I0619 14:39:57.347638 1 factory.go:104] Detected non-Tegra platform: /sys/devices/soc0/family file not found E0619 14:39:57.347646 1 factory.go:112] Incompatible platform detected E0619 14:39:57.347650 1 factory.go:113] If this is a GPU node, did you configure the NVIDIA Container Toolkit? E0619 14:39:57.347654 1 factory.go:114] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites E0619 14:39:57.347659 1 factory.go:115] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start E0619 14:39:57.347663 1 factory.go:116] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes I0619 14:39:57.347670 1 main.go:308] No devices found. Waiting indefinitely.