AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 309 forks source link

gpushare scheduler extender bind code 500 #99

Closed zhaogaolong closed 3 years ago

zhaogaolong commented 4 years ago

envs & version

kubernetes: 1.17 scheduler extender: k8s-gpushare-schd-extender:1.11-d170d8a

errors log

[ debug ] 2020/04/28 11:20:22 gpushare-predicate.go:17: check if the pod name gpu-demo-gpushare-6cfbbdfb66-szs7m can be scheduled on node g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 cache.go:155: GetNodeInfo() uses the existing nodeInfo for g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:282: getAllGPUs: map[0:11019 1:11019] in node g1-med-dev1-100, and dev map[1:0xc422a6ab20 0:0xc422a6ab00]
[ debug ] 2020/04/28 11:20:22 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc422a6ab00
[ debug ] 2020/04/28 11:20:22 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc422a6ab20
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:272: getUsedGPUs: map[0:0 1:0] in node g1-med-dev1-100, and devs map[1:0xc422a6ab20 0:0xc422a6ab00]
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:121: AvailableGPUs: map[0:11019 1:11019] in node g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 gpushare-predicate.go:31: The pod gpu-demo-gpushare-6cfbbdfb66-szs7m in the namespace zhaogaolong can be scheduled on g1-med-dev1-100
[  info ] 2020/04/28 11:20:22 routes.go:93: gpusharingfilter extenderFilterResult = {"Nodes":null,"NodeNames":["g1-med-dev1-100"],"FailedNodes":{},"Error":""}
[ debug ] 2020/04/28 11:20:22 routes.go:162: /gpushare-scheduler/filter response=&{0xc4200bcbe0 0xc421bda600 0xc421f29d00 0x565b70 true false false false 0xc421f29e80 {0xc420354540 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 74 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc422f560e0 0}
[ debug ] 2020/04/28 11:20:22 routes.go:160: /gpushare-scheduler/bind request body = &{0xc421e92bc0 <nil> <nil> false true {0 0} false false false 0x69bfd0}
[ debug ] 2020/04/28 11:20:22 routes.go:121: gpusharingBind ExtenderArgs ={gpu-demo-gpushare-6cfbbdfb66-szs7m zhaogaolong 320c6174-95d6-44d1-ac48-62414d49fe13 g1-med-dev1-100}
[ debug ] 2020/04/28 11:20:22 cache.go:155: GetNodeInfo() uses the existing nodeInfo for g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:143: Allocate() ----Begin to allocate GPU for gpu mem for pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong----
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:282: getAllGPUs: map[0:11019 1:11019] in node g1-med-dev1-100, and dev map[0:0xc422a6ab00 1:0xc422a6ab20]
[ debug ] 2020/04/28 11:20:22 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc422a6ab00
[ debug ] 2020/04/28 11:20:22 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc422a6ab20
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:272: getUsedGPUs: map[0:0 1:0] in node g1-med-dev1-100, and devs map[0:0xc422a6ab00 1:0xc422a6ab20]
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:220: reqGPU for pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong: 256
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:221: AvailableGPUs: map[0:11019 1:11019] in node g1-med-dev1-100
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:239: Find candidate dev id 0 for pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong successfully.
[ debug ] 2020/04/28 11:20:22 nodeinfo.go:147: Allocate() 1. Allocate GPU ID 0 to pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong.----
[  warn ] 2020/04/28 11:20:22 gpushare-bind.go:36: Failed to handle pod gpu-demo-gpushare-6cfbbdfb66-szs7m in ns zhaogaolong due to error Pod "gpu-demo-gpushare-6cfbbdfb66-szs7m" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)
  core.PodSpec{
        Volumes:        []core.Volume{{Name: "cpuinfo", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/lib/lxcfs/proc/cpuinfo", Type: &""}}}, {Name: "meminfo", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/lib/lxcfs/proc/meminfo", Type: &""}}}, {Name: "diskstats", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/lib/lxcfs/proc/diskstats", Type: &""}}}, {Name: "stat", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/lib/lxcfs/proc/stat", Type: &""}}}, {Name: "med-log", VolumeSource: core.VolumeSource{HostPath: &core.HostPathVolumeSource{Path: "/var/log/k8s/zhaogaolong/gpu-demo", Type: &"DirectoryOrCreate"}}}, {Name: "default-token-k4hm4", VolumeSource: core.VolumeSource{Secret: &core.SecretVolumeSource{SecretName: "default-token-k4hm4", DefaultMode: &420}}}},
        InitContainers: nil,
        Containers: []core.Container{
                {
                        ... // 7 identical fields
                        Env:       []core.EnvVar{{Name: "TZ", Value: "Asia/Shanghai"}, {Name: "LANG", Value: "en_US.UTF-8"}, {Name: "LC_ALL", Value: "en_US.UTF-8"}, {Name: "GUAZI_ENV", Value: "dev"}, {Name: "MED_CLUSTER", Value: "dev"}, {Name: "MED_RUNNING_CLUSTER_NAME", Value: "dev1"}, {Name: "MED_ENV", Value: "dev"}, {Name: "CLOUD_ENV", Value: "dev"}, {Name: "MED_REFERENCE", Value: "gpu-demo-gpushare"}, {Name: "MED_GROUP", Value: "zhaogaolong"}, {Name: "MED_APPNAME", Value: "gpu-demo"}, {Name: "MED_DEPLOY", Value: "gpushare"}, {Name: "MED_DUMP", Value: "false"}, {Name: "POD_NAME", ValueFrom: &core.EnvVarSource{FieldRef: &core.ObjectFieldSelector{APIVersion: "v1", FieldPath: "metadata.name"}}}, {Name: "POD_IP", ValueFrom: &core.EnvVarSource{FieldRef: &core.ObjectFieldSelector{APIVersion: "v1", FieldPath: "status.podIP"}}}, {Name: "NODE_NAME", ValueFrom: &core.EnvVarSource{FieldRef: &core.ObjectFieldSelector{APIVersion: "v1", FieldPath: "spec.nodeName"}}}, {Name: "MED_CPU", Value: "1.0"}, {Name: "MED_MEMORY", Value: "2.0"}, {Name: "MED_GPU_SHARE_MEMORY", Value: "0.25"}},
                        Resources: core.ResourceRequirements{Limits: core.ResourceList{"aliyun.com/gpu-mem": {i: resource.int64Amount{value: 256}, s: "256", Format: "DecimalSI"}, "cpu": {i: resource.int64Amount{value: 1}, s: "1", Format: "DecimalSI"}, "memory": {i: resource.int64Amount{value: 2, scale: 9}, s: "2G", Format: "DecimalSI"}}, Requests: core.ResourceList{"aliyun.com/gpu-mem": {i: resource.int64Amount{value: 256}, s: "256", Format: "DecimalSI"}, "cpu": {i: resource.int64Amount{value: 100, scale: -3}, s: "100m", Format: "DecimalSI"}, "memory": {i: resource.int64Amount{value: 400, scale: 6}, s: "400M", Format: "DecimalSI"}}},
                        VolumeMounts: []core.VolumeMount{
                                ... // 2 identical elements
                                {Name: "diskstats", MountPath: "/proc/diskstats"},
                                {Name: "stat", MountPath: "/proc/stat"},
                                {
                                        ... // 3 identical fields
                                        SubPath:          "",
                                        MountPropagation: nil,
-                                       SubPathExpr:      "",
+                                       SubPathExpr:      "$(POD_NAME)/gpu",
                                },
                                {Name: "default-token-k4hm4", ReadOnly: true, MountPath: "/var/run/secrets/kubernetes.io/serviceaccount"},
                        },
                        VolumeDevices: nil,
                        LivenessProbe: nil,
                        ... // 10 identical fields
                },
        },
        EphemeralContainers: nil,
        RestartPolicy:       "Always",
        ... // 24 identical fields
  }
[  info ] 2020/04/28 11:20:22 routes.go:137: extenderBindingResult = {"Error":"Pod \"gpu-demo-gpushare-6cfbbdfb66-szs7m\" is invalid: spec: Forbidden: pod updates may not change fields other than `spec.containers[*].image`, `spec.initContainers[*].image`, `spec.activeDeadlineSeconds` or `spec.tolerations` (only additions to existing tolerations)\n  core.PodSpec{\n  \tVolumes:        []core.Volume{{Name: \"cpuinfo\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/lib/lxcfs/proc/cpuinfo\", Type: \u0026\"\"}}}, {Name: \"meminfo\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/lib/lxcfs/proc/meminfo\", Type: \u0026\"\"}}}, {Name: \"diskstats\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/lib/lxcfs/proc/diskstats\", Type: \u0026\"\"}}}, {Name: \"stat\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/lib/lxcfs/proc/stat\", Type: \u0026\"\"}}}, {Name: \"med-log\", VolumeSource: core.VolumeSource{HostPath: \u0026core.HostPathVolumeSource{Path: \"/var/log/k8s/zhaogaolong/gpu-demo\", Type: \u0026\"DirectoryOrCreate\"}}}, {Name: \"default-token-k4hm4\", VolumeSource: core.VolumeSource{Secret: \u0026core.SecretVolumeSource{SecretName: \"default-token-k4hm4\", DefaultMode: \u0026420}}}},\n  \tInitContainers: nil,\n  \tContainers: []core.Container{\n  \t\t{\n  \t\t\t... // 7 identical fields\n  \t\t\tEnv:       []core.EnvVar{{Name: \"TZ\", Value: \"Asia/Shanghai\"}, {Name: \"LANG\", Value: \"en_US.UTF-8\"}, {Name: \"LC_ALL\", Value: \"en_US.UTF-8\"}, {Name: \"GUAZI_ENV\", Value: \"dev\"}, {Name: \"MED_CLUSTER\", Value: \"dev\"}, {Name: \"MED_RUNNING_CLUSTER_NAME\", Value: \"dev1\"}, {Name: \"MED_ENV\", Value: \"dev\"}, {Name: \"CLOUD_ENV\", Value: \"dev\"}, {Name: \"MED_REFERENCE\", Value: \"gpu-demo-gpushare\"}, {Name: \"MED_GROUP\", Value: \"zhaogaolong\"}, {Name: \"MED_APPNAME\", Value: \"gpu-demo\"}, {Name: \"MED_DEPLOY\", Value: \"gpushare\"}, {Name: \"MED_DUMP\", Value: \"false\"}, {Name: \"POD_NAME\", ValueFrom: \u0026core.EnvVarSource{FieldRef: \u0026core.ObjectFieldSelector{APIVersion: \"v1\", FieldPath: \"metadata.name\"}}}, {Name: \"POD_IP\", ValueFrom: \u0026core.EnvVarSource{FieldRef: \u0026core.ObjectFieldSelector{APIVersion: \"v1\", FieldPath: \"status.podIP\"}}}, {Name: \"NODE_NAME\", ValueFrom: \u0026core.EnvVarSource{FieldRef: \u0026core.ObjectFieldSelector{APIVersion: \"v1\", FieldPath: \"spec.nodeName\"}}}, {Name: \"MED_CPU\", Value: \"1.0\"}, {Name: \"MED_MEMORY\", Value: \"2.0\"}, {Name: \"MED_GPU_SHARE_MEMORY\", Value: \"0.25\"}},\n  \t\t\tResources: core.ResourceRequirements{Limits: core.ResourceList{\"aliyun.com/gpu-mem\": {i: resource.int64Amount{value: 256}, s: \"256\", Format: \"DecimalSI\"}, \"cpu\": {i: resource.int64Amount{value: 1}, s: \"1\", Format: \"DecimalSI\"}, \"memory\": {i: resource.int64Amount{value: 2, scale: 9}, s: \"2G\", Format: \"DecimalSI\"}}, Requests: core.ResourceList{\"aliyun.com/gpu-mem\": {i: resource.int64Amount{value: 256}, s: \"256\", Format: \"DecimalSI\"}, \"cpu\": {i: resource.int64Amount{value: 100, scale: -3}, s: \"100m\", Format: \"DecimalSI\"}, \"memory\": {i: resource.int64Amount{value: 400, scale: 6}, s: \"400M\", Format: \"DecimalSI\"}}},\n  \t\t\tVolumeMounts: []core.VolumeMount{\n  \t\t\t\t... // 2 identical elements\n  \t\t\t\t{Name: \"diskstats\", MountPath: \"/proc/diskstats\"},\n  \t\t\t\t{Name: \"stat\", MountPath: \"/proc/stat\"},\n  \t\t\t\t{\n  \t\t\t\t\t... // 3 identical fields\n  \t\t\t\t\tSubPath:          \"\",\n  \t\t\t\t\tMountPropagation: nil,\n- \t\t\t\t\tSubPathExpr:      \"\",\n+ \t\t\t\t\tSubPathExpr:      \"$(POD_NAME)/gpu\",\n  \t\t\t\t},\n  \t\t\t\t{Name: \"default-token-k4hm4\", ReadOnly: true, MountPath: \"/var/run/secrets/kubernetes.io/serviceaccount\"},\n  \t\t\t},\n  \t\t\tVolumeDevices: nil,\n  \t\t\tLivenessProbe: nil,\n  \t\t\t... // 10 identical fields\n  \t\t},\n  \t},\n  \tEphemeralContainers: nil,\n  \tRestartPolicy:       \"Always\",\n  \t... // 24 identical fields\n  }\n"}
[ debug ] 2020/04/28 11:20:22 routes.go:162: /gpushare-scheduler/bind response=&{0xc4200bcbe0 0xc4222c2c00 0xc421ea7280 0x565b70 true false false false 0xc421ea7300 {0xc421f46380 map[Content-Type:[application/json]] true true} map[Content-Type:[application/json]] true 4029 -1 500 false false [] 0 [84 117 101 44 32 50 56 32 65 112 114 32 50 48 50 48 32 49 49 58 50 48 58 50 50 32 71 77 84] [0 0 0 0 0 0 0 0 0 0] [53 48 48] 0xc422bb82a0 0}

I suspect it is a version of the dependency kube client go version too old. in GOpkg.toml

[[constraint]]
  name = "k8s.io/client-go"
  version = "~v8.0.0"

but this version can't support kubernetes 1.17

cheyang commented 4 years ago

Thank you for reporting it. gpushare-scheduler only changes pod metadata. It shouldn't be impacted. https://discuss.kubernetes.io/t/updating-annotations-on-pod/11025

jaroslawk commented 3 years ago

the source of this issue is an old client-go dependency as you mention. I had the same issue.

This fork solves this relatively ok: https://github.com/k8s-gpushare/gpushare-scheduler-extender

zhaogaolong commented 3 years ago

yes, I was fixed. please close this issues

hellobiek commented 2 years ago

@zhaogaolong how you fixed it?

yangweige commented 1 year ago

@zhaogaolong How did you solve this problem?Can you tell us a little bit about it? Thank you

yangweige commented 1 year ago

That's how we use it,Configure the gpushare parameters in the YAML file of Kubernetes,But it was wrong.Please tell us how to fix this GPU usage method. Thank you

This error: Binding rejected: failed bind with extender at URL http://192.168.1.1:32766/gpushare-scheduler/bind, code 500

the YAML file of Kubernetes: resources: requests: cpu: 1000m memory: 1000Mi limits: cpu: 2000m memory: 4096Mi aliyun.com/gpu-count: 1 aliyun.com/gpu-mem: 2