AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 309 forks source link

运行了一年后,创建新的 pod 报错 failed bind with extender at URL http://127.0.0.1:32766/gpushare-scheduler/bind, code 500 #229

Open klvchen opened 2 weeks ago

klvchen commented 2 weeks ago

这边是自建的K8S,版本是 v1.24.6,证书自己修改的是10年。 gpushare-device-plugin 镜像是 registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-plugin:v2-1.11-aff8a23 k8s-gpushare-schd-extender 镜像是 registry.cn-hangzhou.aliyuncs.com/acs/k8s-gpushare-schd-extender:1.11-d170d8a

image 今天更新一个服务,发现无法创建 pod ,用了官方的测试例子,也是报同样的问题 binding rejected: failed bind with extender at URL http://127.0.0.1:32766/gpushare-scheduler/bind, code 500

#使用的测试例子的yaml
cat test.yaml
apiVersion: apps/v1 
kind: StatefulSet

metadata:
  name: binpack-1
  labels:
    app: binpack-1

spec:
  replicas: 2
  serviceName: "binpack-1"
  podManagementPolicy: "Parallel"
  selector: # define how the deployment finds the pods it manages
    matchLabels:
      app: binpack-1

  template: # define the pods specifications
    metadata:
      labels:
        app: binpack-1

    spec:
      containers:
      - name: binpack-1
        image: cheyang/gpu-player:v2
        resources:
          limits:
            # GiB
            aliyun.com/gpu-mem: 1

# 无法启动后检查
kubectl describe pod binpack-1-0

image 查看了 kubectl -n kube-system get pod image

gpushare-schd-extender-6cf7d6cdd9-nb4ph 这个 pod 里面有很多 Unauthorized 字眼,不知道是否跟这有关系

[  warn ] 2024/08/26 09:39:13 gpushare-bind.go:25: Failed to handle pod binpack-1-0 in ns default due to error Unauthorized
[  info ] 2024/08/26 09:39:13 routes.go:137: extenderBindingResult = {"Error":"Unauthorized"}
[ debug ] 2024/08/26 09:39:13 routes.go:162: /gpushare-scheduler/bind response=&{0xc420198780 0xc420395800 0xc42089f400 0x565b70 true false false false 0xc420d72740 {0xc420e0e540 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 24 -1 500 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc4203699d0 0}
E0826 09:39:14.488506       1 reflector.go:205] github.com/AliyunContainerService/gpushare-scheduler-extender/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Pod: Unauthorized
E0826 09:39:14.489290       1 reflector.go:205] github.com/AliyunContainerService/gpushare-scheduler-extender/vendor/k8s.io/client-go/informers/factory.go:130: Failed to list *v1.Node: Unauthorized
[ debug ] 2024/08/26 09:39:14 routes.go:160: /gpushare-scheduler/filter request body = &{0xc420627940 <nil> <nil> false true {0 0} false false false 0x69bfd0}

请问该如何解决这个问题,谢谢~