AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 308 forks source link

Insufficient aliyun.com/gpu-mem. #149

Closed 2811299 closed 3 years ago

2811299 commented 3 years ago

Followed the install document, still have this issue, anyone has idea how to debug this please?

k get all -n kube-system NAME READY STATUS RESTARTS AGE pod/coredns-74ff55c5b-ktttv 1/1 Running 2 32d pod/coredns-74ff55c5b-m5krh 1/1 Running 2 32d pod/deployment-bundler-controller-manager-646c5785f5-txsnn 2/2 Running 13 29d pod/etcd-af-svr1 1/1 Running 2 32d pod/gpushare-device-plugin-ds-g2cb5 1/1 Running 1 25d pod/gpushare-schd-extender-569b9c94ff-275kh 0/1 NodeAffinity 0 32d pod/gpushare-schd-extender-569b9c94ff-hcjlk 1/1 Running 1 25d pod/kube-apiserver-af-svr1 1/1 Running 3 32d pod/kube-controller-manager-af-svr1 1/1 Running 3 32d pod/kube-flannel-ds-5kt79 1/1 Running 25 169d pod/kube-flannel-ds-mkpp9 1/1 Running 12 124d pod/kube-proxy-7f28g 1/1 Running 0 32d pod/kube-proxy-cmgf5 1/1 Running 2 32d pod/kube-scheduler-af-svr1 1/1 Running 3 32d pod/kubedb-operator-55cd5c9b7b-lxwbb 1/1 Running 41 169d pod/kuboard-59bdf4d5fb-2fwn4 1/1 Running 23 141d pod/metrics-server-56b49c5f5b-5x6vz 1/1 Running 42 141d

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/gpushare-schd-extender NodePort 10.103.166.89 12345:32766/TCP 32d service/kube-dns ClusterIP 10.96.0.10 53/UDP,53/TCP,9153/TCP 169d service/kubedb-operator ClusterIP 10.102.189.137 443/TCP 169d service/kubelet ClusterIP None 10250/TCP 130d service/kuboard NodePort 10.105.96.25 80:32567/TCP 141d service/metrics-server ClusterIP 10.99.55.143 443/TCP 141d

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/gpushare-device-plugin-ds 1 1 1 1 1 gpushare=true 32d daemonset.apps/kube-flannel-ds 2 2 2 2 2 169d daemonset.apps/kube-proxy 2 2 2 2 2 kubernetes.io/os=linux 169d

NAME READY UP-TO-DATE AVAILABLE AGE deployment.apps/coredns 2/2 2 2 169d deployment.apps/deployment-bundler-controller-manager 1/1 1 1 32d deployment.apps/gpushare-schd-extender 1/1 1 1 32d deployment.apps/kubedb-operator 1/1 1 1 169d deployment.apps/kuboard 1/1 1 1 141d deployment.apps/metrics-server 1/1 1 1 141d

NAME DESIRED CURRENT READY AGE replicaset.apps/coredns-66bff467f8 0 0 0 32d replicaset.apps/coredns-6955765f44 0 0 0 169d replicaset.apps/coredns-74ff55c5b 2 2 2 32d replicaset.apps/coredns-f9fd979d6 0 0 0 32d replicaset.apps/gpushare-schd-extender-569b9c94ff 1 1 1 32d replicaset.apps/kubedb-operator-55cd5c9b7b 1 1 1 169d replicaset.apps/kuboard-59bdf4d5fb 1 1 1 141d replicaset.apps/metrics-server-56b49c5f5b 1 1 1 141d

timozerrer commented 3 years ago

Is this still open @2811299 ?

cheyang commented 3 years ago

Please run the command kubectl logs -n kube-system gpushare-schd-extender-569b9c94ff-hcjlk, and get the logs.