Closed Jacoobr closed 2 years ago
hi @cheyang, when i installed gpushare-scheduler-extender follow the install guide, the 'gpushare-device-plugin-ds-r8k2k' pod is pending status like bellowed:
#kubectl get pods -n kube-system NAME READY STATUS RESTARTS AGE coredns-7f6cbbb7b8-r8ptt 1/1 Running 1 (6m39s ago) 4h28m coredns-7f6cbbb7b8-v969w 1/1 Running 1 (6m39s ago) 4h28m etcd-ia-x9drg-qf-invalid-entry-length-16-fixed-up-to-11 0/1 Running 7 (6m44s ago) 4h28m gpushare-device-plugin-ds-r8k2k 0/1 Pending 0 24m gpushare-schd-extender-569b9c94ff-2gkr8 1/1 Running 1 (6m34s ago) 32m kube-apiserver-ia-x9drg-qf-invalid-entry-length-16-fixed-up-to-11 0/1 Running 6 (6m34s ago) 4h28m kube-controller-manager-ia-x9drg-qf-invalid-entry-length-16-fixed-up-to-11 0/1 Running 6 (6m44s ago) 4h15m kube-flannel-ds-gfvtf 1/1 Running 4 (6m44s ago) 3h7m kube-proxy-hslvv 1/1 Running 3 (6m44s ago) 4h28m kube-scheduler-ia-x9drg-qf-invalid-entry-length-16-fixed-up-to-11 1/1 Running 2 (6m30s ago) 27m
here are my gpus info:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla K40c Off | 00000000:02:00.0 Off | 0 | | 23% 44C P0 64W / 235W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 Tesla K40c Off | 00000000:03:00.0 Off | 0 | | 25% 50C P0 69W / 235W | 0MiB / 11441MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 2 Tesla K40c Off | 00000000:84:00.0 Off | 0 | | 26% 52C P0 70W / 235W | 0MiB / 11441MiB | 56% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
the describe info of the gpushare-device-plugin-ds-r8k2k pod: #kubectl describe pod gpushare-device-plugin-ds-r8k2k -n kube-system
#kubectl describe pod gpushare-device-plugin-ds-r8k2k -n kube-system
Name: gpushare-device-plugin-ds-r8k2k Namespace: kube-system Priority: 0 Node: <none> Labels: app=gpushare component=gpushare-device-plugin controller-revision-hash=5db8d7589b name=gpushare-device-plugin-ds pod-template-generation=1 Annotations: scheduler.alpha.kubernetes.io/critical-pod: Status: Pending IP: IPs: <none> Controlled By: DaemonSet/gpushare-device-plugin-ds Containers: gpushare: Image: registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23 Port: <none> Host Port: <none> Command: gpushare-device-plugin-v2 -logtostderr --v=5 --memory-unit=GiB Limits: cpu: 4 memory: 300Mi Requests: cpu: 4 memory: 300Mi Environment: KUBECONFIG: /etc/kubernetes/kubelet.conf NODE_NAME: (v1:spec.nodeName) Mounts: /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2j6dj (ro) Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: kube-api-access-2j6dj: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Guaranteed Node-Selectors: gpushare=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: <none> root@ia-X9DRG-QF-Invalid-entry-length-16-Fixed-up-to-11:/usr/bin# kubectl describe pod gpushare-device-plugin-ds-r8k2k Error from server (NotFound): pods "gpushare-device-plugin-ds-r8k2k" not found root@ia-X9DRG-QF-Invalid-entry-length-16-Fixed-up-to-11:/usr/bin# kubectl describe pod gpushare-device-plugin-ds-r8k2k -n kube-system Name: gpushare-device-plugin-ds-r8k2k Namespace: kube-system Priority: 0 Node: <none> Labels: app=gpushare component=gpushare-device-plugin controller-revision-hash=5db8d7589b name=gpushare-device-plugin-ds pod-template-generation=1 Annotations: scheduler.alpha.kubernetes.io/critical-pod: Status: Pending IP: IPs: <none> Controlled By: DaemonSet/gpushare-device-plugin-ds Containers: gpushare: Image: registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23 Port: <none> Host Port: <none> Command: gpushare-device-plugin-v2 -logtostderr --v=5 --memory-unit=GiB Limits: cpu: 4 memory: 300Mi Requests: cpu: 4 memory: 300Mi Environment: KUBECONFIG: /etc/kubernetes/kubelet.conf NODE_NAME: (v1:spec.nodeName) Mounts: /var/lib/kubelet/device-plugins from device-plugin (rw) /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2j6dj (ro) Volumes: device-plugin: Type: HostPath (bare host directory volume) Path: /var/lib/kubelet/device-plugins HostPathType: kube-api-access-2j6dj: Type: Projected (a volume that contains injected data from multiple sources) TokenExpirationSeconds: 3607 ConfigMapName: kube-root-ca.crt ConfigMapOptional: <nil> DownwardAPI: true QoS Class: Guaranteed Node-Selectors: gpushare=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule op=Exists node.kubernetes.io/memory-pressure:NoSchedule op=Exists node.kubernetes.io/network-unavailable:NoSchedule op=Exists node.kubernetes.io/not-ready:NoExecute op=Exists node.kubernetes.io/pid-pressure:NoSchedule op=Exists node.kubernetes.io/unreachable:NoExecute op=Exists node.kubernetes.io/unschedulable:NoSchedule op=Exists Events: <none>
the daemon.json info:
cat /etc/docker/daemon.json { "default-runtime":"nvidia", "iptables": false, "exec-opts":["native.cgroupdriver=systemd"], "insecure-registries": ["registry.cn-hangzhou.aliyuncs.com"], "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }, "registry-mirrors": [ "https://registry.docker-cn.com", "http://hub-mirror.c.163.com", "https://docker.mirrors.ustc.edu.cn" ] }
Any wrong with my setting?
Hello, how did you solve this problem?
hi @cheyang, when i installed gpushare-scheduler-extender follow the install guide, the 'gpushare-device-plugin-ds-r8k2k' pod is pending status like bellowed:
here are my gpus info:
the describe info of the gpushare-device-plugin-ds-r8k2k pod:
#kubectl describe pod gpushare-device-plugin-ds-r8k2k -n kube-system
the daemon.json info:
Any wrong with my setting?