AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 308 forks source link

gpushare-device-plugin STATUS is Pending #161

Open Jacoobr opened 2 years ago

Jacoobr commented 2 years ago

hi @cheyang, when i installed gpushare-scheduler-extender follow the install guide, the 'gpushare-device-plugin-ds-r8k2k' pod is pending status like bellowed:

#kubectl get pods -n kube-system

NAME                                                                         READY   STATUS    RESTARTS      AGE
coredns-7f6cbbb7b8-r8ptt                                                     1/1     Running   1 (34m ago)   4h56m
coredns-7f6cbbb7b8-v969w                                                     1/1     Running   1 (34m ago)   4h56m
etcd-ia-x9drg-qf-invalid-entry-length-16-fixed-up-to-11                      1/1     Running   7 (34m ago)   4h56m
gpushare-device-plugin-ds-r8k2k                                              0/1     Pending   0             52m
gpushare-schd-extender-569b9c94ff-2gkr8                                      1/1     Running   1 (34m ago)   60m
kube-apiserver-ia-x9drg-qf-invalid-entry-length-16-fixed-up-to-11            1/1     Running   6 (34m ago)   4h56m
kube-controller-manager-ia-x9drg-qf-invalid-entry-length-16-fixed-up-to-11   1/1     Running   6 (34m ago)   4h43m
kube-flannel-ds-gfvtf                                                        1/1     Running   4 (34m ago)   3h35m
kube-proxy-hslvv                                                             1/1     Running   3 (34m ago)   4h56m
kube-scheduler-ia-x9drg-qf-invalid-entry-length-16-fixed-up-to-11            1/1     Running   2 (34m ago)   54m

here are my gpus info:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla K40c          Off  | 00000000:02:00.0 Off |                    0 |
| 23%   44C    P0    64W / 235W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K40c          Off  | 00000000:03:00.0 Off |                    0 |
| 25%   50C    P0    69W / 235W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K40c          Off  | 00000000:84:00.0 Off |                    0 |
| 26%   52C    P0    70W / 235W |      0MiB / 11441MiB |     56%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

the describe info of the gpushare-device-plugin-ds-r8k2k pod: #kubectl describe pod gpushare-device-plugin-ds-r8k2k -n kube-system

Name:           gpushare-device-plugin-ds-r8k2k
Namespace:      kube-system
Priority:       0
Node:           <none>
Labels:         app=gpushare
                component=gpushare-device-plugin
                controller-revision-hash=5db8d7589b
                name=gpushare-device-plugin-ds
                pod-template-generation=1
Annotations:    scheduler.alpha.kubernetes.io/critical-pod:
Status:         Pending
IP:
IPs:            <none>
Controlled By:  DaemonSet/gpushare-device-plugin-ds
Containers:
  gpushare:
    Image:      registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23
    Port:       <none>
    Host Port:  <none>
    Command:
      gpushare-device-plugin-v2
      -logtostderr
      --v=5
      --memory-unit=GiB
    Limits:
      cpu:     4
      memory:  300Mi
    Requests:
      cpu:     4
      memory:  300Mi
    Environment:
      KUBECONFIG:  /etc/kubernetes/kubelet.conf
      NODE_NAME:    (v1:spec.nodeName)
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2j6dj (ro)
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  kube-api-access-2j6dj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              gpushare=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>
root@ia-X9DRG-QF-Invalid-entry-length-16-Fixed-up-to-11:/usr/bin# kubectl describe pod gpushare-device-plugin-ds-r8k2k
Error from server (NotFound): pods "gpushare-device-plugin-ds-r8k2k" not found
root@ia-X9DRG-QF-Invalid-entry-length-16-Fixed-up-to-11:/usr/bin# kubectl describe pod gpushare-device-plugin-ds-r8k2k -n kube-system
Name:           gpushare-device-plugin-ds-r8k2k
Namespace:      kube-system
Priority:       0
Node:           <none>
Labels:         app=gpushare
                component=gpushare-device-plugin
                controller-revision-hash=5db8d7589b
                name=gpushare-device-plugin-ds
                pod-template-generation=1
Annotations:    scheduler.alpha.kubernetes.io/critical-pod:
Status:         Pending
IP:
IPs:            <none>
Controlled By:  DaemonSet/gpushare-device-plugin-ds
Containers:
  gpushare:
    Image:      registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23
    Port:       <none>
    Host Port:  <none>
    Command:
      gpushare-device-plugin-v2
      -logtostderr
      --v=5
      --memory-unit=GiB
    Limits:
      cpu:     4
      memory:  300Mi
    Requests:
      cpu:     4
      memory:  300Mi
    Environment:
      KUBECONFIG:  /etc/kubernetes/kubelet.conf
      NODE_NAME:    (v1:spec.nodeName)
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2j6dj (ro)
Volumes:
  device-plugin:
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:
  kube-api-access-2j6dj:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Guaranteed
Node-Selectors:              gpushare=true
Tolerations:                 node.kubernetes.io/disk-pressure:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/network-unavailable:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists
                             node.kubernetes.io/pid-pressure:NoSchedule op=Exists
                             node.kubernetes.io/unreachable:NoExecute op=Exists
                             node.kubernetes.io/unschedulable:NoSchedule op=Exists
Events:                      <none>

the daemon.json info:

cat /etc/docker/daemon.json
{
    "default-runtime":"nvidia",
    "iptables": false,
    "exec-opts":["native.cgroupdriver=systemd"],
    "insecure-registries": ["registry.cn-hangzhou.aliyuncs.com"],
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "registry-mirrors": [
    "https://registry.docker-cn.com",
    "http://hub-mirror.c.163.com",
    "https://docker.mirrors.ustc.edu.cn"
    ]

}

Any wrong with my setting?

wsxiaozhang commented 2 years ago

it seems your device plugin pod hasn't been allocated to any node. Could you pls check whether your GPU node has label "gpushare=true" added?

chenbodeng719 commented 2 years ago

@Jacoobr Same issue. Do you solve it? I have the node labeled.