AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.36k stars 303 forks source link

读取到了两块显卡,但是请求/gpushare-scheduler/filter后部分容器一直只能调度到其中一块显卡 #186

Closed 1003111014 closed 1 year ago

1003111014 commented 1 year ago

调度日志: begin to sync gpushare pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in ns ai-model [ debug ] 2022/09/05 13:42:11 cache.go:90: Add or update pod info: &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx,GenerateName:p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-,Namespace:ai-model,SelfLink:,UID:4c24f578-1a0c-417b-adb0-76b0b2604c06,ResourceVersion:60857261,Generation:0,CreationTimestamp:2022-09-05 13:42:11 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 00c54b7a-d015-11ec-8db7-12cc16fb82ca,pod-template-hash: 6f8f5c45dc,},Annotations:map[string]string{cattle.io/timestamp: 2022-09-02T09:28:07Z,field.cattle.io/ports: [[{"containerPort":7070,"dnsName":"p-00c54b7a-d015-11ec-8db7-12cc16fb82ca","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],},OwnerReferences:[{apps/v1 ReplicaSet p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc 56dd7312-1c8e-4531-8961-331905556825 0xc420719ada 0xc420719adb}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-00c54b7a-d015-11ec-8db7-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/air_switch:amd-2.4 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{C apabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42090e510} {node.kubernetes.io/unreachable Exists NoExecute 0xc42090e530}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} [ debug ] 2022/09/05 13:42:11 cache.go:91: Node map[worker1:0xc420fd43c0 worker2:0xc4206b8cc0] [ debug ] 2022/09/05 13:42:11 cache.go:93: pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in ns ai-model is not assigned to any node, skip [ debug ] 2022/09/05 13:42:11 controller.go:234: end processNextWorkItem() [ debug ] 2022/09/05 13:42:11 routes.go:160: /gpushare-scheduler/filter request body = &{0xc420575160 <nil> <nil> false true {0 0} false false false 0x69c120} [ debug ] 2022/09/05 13:42:11 routes.go:81: gpusharingfilter ExtenderArgs ={&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx,GenerateName:p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-,Namespace:ai-model,SelfLink:,UID:4c24f578-1a0c-417b-adb0-76b0b2604c06,ResourceVersion:60857261,Generation:0,CreationTimestamp:2022-09-05 13:42:11 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 00c54b7a-d015-11ec-8db7-12cc16fb82ca,pod-template-hash: 6f8f5c45dc,},Annotations:map[string]string{cattle.io/timestamp: 2022-09-02T09:28:07Z,field.cattle.io/ports: [[{"containerPort":7070,"dnsName":"p-00c54b7a-d015-11ec-8db7-12cc16fb82ca","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],},OwnerReferences:[{apps/v1 ReplicaSet p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc 56dd7312-1c8e-4531-8961-331905556825 0xc420e88737 0xc420e88738}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-00c54b7a-d015-11ec-8db7-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/air_switch:amd-2.4 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent Security Context{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420e88830} {node.kubernetes.io/unreachable Exists NoExecute 0xc420e88850}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} nil 0xc4205754a0} [ info ] 2022/09/05 13:42:11 gpushare-predicate.go:17: check if the pod name p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx can be scheduled on node worker1 [ info ] 2022/09/05 13:42:11 cache.go:160: GetNodeInfo() uses the existing nodeInfo for worker1 [ debug ] 2022/09/05 13:42:11 cache.go:162: node worker1 with devices map[0:0xc4203babc0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:423: getAllGPUs: map[0:12288] in node worker1, and dev map[0:0xc4203babc0] [ debug ] 2022/09/05 13:42:11 deviceinfo.go:42: GetUsedGPUMemory() podMap map[e0aa6077-113c-4a86-b4f4-6a93d2754747:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189-784bbcf688-mmb56,GenerateName:p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189-784bbcf688-,Namespace:ai-model,SelfLink:,UID:e0aa6077-113c-4a86-b4f4-6a93d2754747,ResourceVersion:60854784,Generation:0,CreationTimestamp:2022-09-05 13:34:53 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189,pod-template-hash: 784bbcf688,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384893715201857,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cattle.io/timestamp: 2022-09-02T09:09:47Z,field.cattle.io/ports: [[{"containerPort":7070,"dnsName":"p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],workload.cattle.io/state: {"d29ya2VyMQ==":"local:machine-95hdb"},},OwnerReferences:[{apps/v1 ReplicaSet p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189-784bbcf688 310dad46-6da8-4414-8d92-be6190f43f5c 0xc4208e8258 0xc4208e8259}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189 registry.kk.com/ai/offline_function_tensorrt/sign:amd-2.3 [] [] [{port-7070 0 7070 TCP }] [] [{DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_SECRET_KEY nil nil} {MINIO_SECURE False nil} {NVIDIA_VISIBLE_D EVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4208e8288} {node.kubernetes.io/unreachable Exists NoExecute 0xc4208e82b0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:53 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:53 +0000 UTC ContainersNotReady containers with unready status: [c-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:53 +0000 UTC ContainersNotReady containers with unready status: [c-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:53 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:34:53 +0000 UTC,ContainerStatuses:[{c-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} fal se 0 registry.kk.com/ai/offline_function_tensorrt/sign:amd-2.3 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 9a5d93f9-9c35-4633-b04d-1110035ebffd:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-56c0cd7c-d026-11ec-9141-12cc16fb82ca-85f94fd96c-qb99p,GenerateName:p-56c0cd7c-d026-11ec-9141-12cc16fb82ca-85f94fd96c-,Namespace:ai-model,SelfLink:,UID:9a5d93f9-9c35-4633-b04d-1110035ebffd,ResourceVersion:60853181,Generation:0,CreationTimestamp:2022-09-05 13:31:19 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 56c0cd7c-d026-11ec-9141-12cc16fb82ca,pod-template-hash: 85f94fd96c,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384679444463638,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-56c0cd7c-d026-11ec-9141-12cc16fb82ca-85f94fd96c ced90aa2-65c4-4170-882a-67f1fff26ff4 0xc420eab908 0xc420eab909}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-56c0cd7c-d026-11ec-9141-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/oil_leak:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false fals e}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420eab918} {node.kubernetes.io/unreachable Exists NoExecute 0xc420eab920}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:19 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:19 +0000 UTC ContainersNotReady containers with unready status: [c-56c0cd7c-d026-11ec-9141-12cc16fb82ca]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:19 +0000 UTC ContainersNotReady containers with unready status: [c-56c0cd7c-d026-11ec-9141-12cc16fb82ca]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:19 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:31:19 +0000 UTC,ContainerStatuses:[{c-56c0cd7c-d026-11ec-9141-12cc16fb82ca {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/oil_leak:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 7036996a-83c0-450d-bffb-54a18866e28c:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189-c64b6cbb6-gwztm,GenerateName:p-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189-c64b6cbb6-,Namespace:ai-model,SelfLink:,UID:7036996a-83c0-450d-bffb-54a18866e28c,ResourceVersion:59250704,Generation:0,Cr 09:32:43 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 026d8cd8-e0b6-11ec-b0ec-4a6aa9346189,pod-template-hash: c64b6cbb6,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662111163298589414,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189-c64b6cbb6 e5d757bf-9098-43f1-99fd-e49a3146b064 0xc42075d248 0xc42075d249}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189 registry.kk.com/ai/offline_function_tensorrt/break:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerNam e:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42075d268} {node.kubernetes.io/unreachable Exists NoExecute 0xc42075d300}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-02 09:32:43 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-02 09:32:43 +0000 UTC ContainersNotReady containers with unready status: [c-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-02 09:32:43 +0000 UTC ContainersNotReady containers with unready status: [c-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-02 09:32:43 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-02 09:32:43 +0000 UTC,ContainerStatuses:[{c-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/break:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 1c170222-116a-47a8-aa2f-a3d2abf3f39e:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-ca735e4a-00fc-11ed-9340-667f53977afd-5fc78ccb9d-tnbt5,GenerateName:p-ca735e4a-00fc-11ed-9340-667f53977afd-5fc78ccb9d-,Namespace:ai-model,SelfLink:,UID:1c170222-116a-47a8-aa2f-a3d2abf3f39e,ResourceVersion:60854436,Generation:0,CreationTimestamp:2022-09-05 13:34:05 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: ca735e4a-00fc-11ed-9340-667f53977afd,pod-template-hash: 5fc78ccb9d,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384845685953755,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-ca735e4a-00fc-11ed-9340-667f53977afd- 5fc78ccb9d e5afde70-d544-40a8-9172-1551da18a995 0xc420f664c8 0xc420f664c9}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-ca735e4a-00fc-11ed-9340-667f53977afd registry.kk.com/ai/offline_function_tensorrt/meter_sf6:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420f664d8} {node.kubernetes.io/unreachable Exists NoExecute 0xc420f664e0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:05 +0000 UTC } {Ready False 0001-01-01 00:00:00 UTC 2022-09-05 13:34:05 +0000 UTC ContainersNotReady containers with unready status: [c-ca735e4a-00fc-11ed-9340-667f53977afd]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:05 +0000 UTC ContainersNotReady containers with unready status: [c-ca735e4a-00fc-11ed-9340-667f53977afd]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:05 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:34:05 +0000 UTC,ContainerStatuses:[{c-ca735e4a-00fc-11ed-9340-667f53977afd {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/meter_sf6:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 3f0a633e-3aa4-44b8-95ff-708dd07104a8:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-a2ad1890-04b4-11ed-a69a-da2f2c420d44-675bd85754-8pl5z,GenerateName:p-a2ad1890-04b4-11ed-a69a-da2f2c420d44-675bd85754-,Namespace:ai-model,SelfLink:,UID:3f0a633e-3aa4-44b8-95ff-708dd07104a8,ResourceVersion:60854674,Generation:0,CreationTimestamp:2022-09-05 13:34:37 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: a2ad1890-04b4-11ed-a69a-da2f2c420d44,pod-template-hash: 675bd85754,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384877769111575,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-a2ad1890-04b4-11ed-a69a-da2f2c420d44-675bd85754 5df24a79-2516-4960-9123-75ff0da93c1f 0xc42047e8c8 0xc42047e8c9}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-a2ad1890-04b4-11ed-a69a-da2f2c420d44 registry.kk.com/ai/offline_function_tensorrt/arrest [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42047e8d8} {node.kubernetes.io/unreachable Exists NoExecute 0xc42047e8e0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:37 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:37 +0000 UTC ContainersNotReady containers with unready status: [c-a2ad1890-04b4-11ed-a69a-da2f2c420d44]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:37 +0000 UTC ContainersNotReady containers with unready status: [c-a2ad1890-04b4-11ed-a69a-da2f2c420d44]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:37 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:34:37 +0 UTC,ContainerStatuses:[{c-a2ad1890-04b4-11ed-a69a-da2f2c420d44 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/arrester:amd-2.3 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 9344fe44-a270-4dc9-b7fb-73e5faa04297:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-ad4a9c10-d5c4-11ec-a679-664243c08fda-64bc47bf7f-9h4sd,GenerateName:p-ad4a9c10-d5c4-11ec-a679-664243c08fda-64bc47bf7f-,Namespace:ai-model,SelfLink:,UID:9344fe44-a270-4dc9-b7fb-73e5faa04297,ResourceVersion:60854558,Generation:0,CreationTimestamp:2022-09-05 13:34:21 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: ad4a9c10-d5c4-11ec-a679-664243c08fda,pod-template-hash: 64bc47bf7f,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384861859045536,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-ad4a9c10-d5c4-11ec-a679-664243c08fda-64bc47bf7f bc501ac1-4d2f-4981-ba98-1c2a2c2f10bc 0xc42083b7a8 0xc42083b7a9}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-ad4a9c10-d5c4-11ec-a679-664243c08fda registry.kk.com/ai/offline_function_tensorrt/opening_closing:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k Decima lSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42083b7b8} {node.kubernetes.io/unreachable Exists NoExecute 0xc42083b7d0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:21 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:21 +0000 UTC ContainersNotReady containers with unready status: [c-ad4a9c10-d5c4-11ec-a679-664243c08fda]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:21 +0000 UTC ContainersNotReady containers with unready status: [c-ad4a9c10-d5c4-11ec-a679-664243c08fda]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:34:21 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:34:21 +0000 UTC,ContainerStatuses:[{c-ad4a9c10-d5c4-11ec-a679-664243c08fda {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/opening_closing:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} aaf48e5b-8d2b-47d2-ac82-0e189e58b82c:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-b521d0ac-d4c0-11ec-8214-4a2f8bc05280-8767b6d44-jmdkr,Gen erateName:p-b521d0ac-d4c0-11ec-8214-4a2f8bc05280-8767b6d44-,Namespace:ai-model,SelfLink:,UID:aaf48e5b-8d2b-47d2-ac82-0e189e58b82c,ResourceVersion:60853402,Generation:0,CreationTimestamp:2022-09-05 13:31:35 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: b521d0ac-d4c0-11ec-8214-4a2f8bc05280,pod-template-hash: 8767b6d44,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384695714386603,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cni.projectcalico.org/podIP: 10.42.3.193/32,cni.projectcalico.org/podIPs: 10.42.3.193/32,},OwnerReferences:[{apps/v1 ReplicaSet p-b521d0ac-d4c0-11ec-8214-4a2f8bc05280-8767b6d44 f85c8ab5-e6c2-4db2-886c-66c705d9a75d 0xc4211c4f80 0xc4211c4f81}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-b521d0ac-d4c0-11ec-8214-4a2f8bc05280 registry.kk.com/ai/offline_function_tensorrt/oil_temperature:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName :worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4211c4f90} {node.kubernetes.io/unreachable Exists NoExecute 0xc4211c4f98}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Running,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:35 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:37 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:37 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:35 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:10.42.3.193,StartTime:2022-09-05 13:31:35 +0000 UTC,ContainerStatuses:[{c-b521d0ac-d4c0-11ec-8214-4a2f8bc05280 {nil ContainerStateRunning{StartedAt:2022-09-05 13:31:37 +0000 UTC,} nil} {nil nil nil} true 0 registry.kk.com/ai/offline_function_tensorrt/oil_temperature:amd-2.2 docker-pullable://registry.kk.com/ai/offline_function_tensorrt/oil_temperature@sha256:9a3c5f598e91895cd8d64f75ec937e0e97678de97f29795d186576c937415d27 docker://822e435aa8b7bad3b7bc2117bc084d28a597b84b1ca94301a223f66ce0d14276}],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 59e08cc0-f4a5-46f7-b464-8a93a8949689:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-d049b5a6-d025-11ec-9bf1-12cc16fb82ca-85557f5c56-knqln,GenerateName:p-d049b5a6-d025-11ec-9bf1-12cc16fb82ca-85557f5c56-,Namespace:ai-model,SelfLink:,UID:59e08cc0-f4a5-46f7-b464-8a93a8949689,ResourceVersion:60853539,Generation:0,CreationTimestamp:2022-09-05 13:31:48 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: d049b5a6-d 025-11ec-9bf1-12cc16fb82ca,pod-template-hash: 85557f5c56,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384708195011096,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cni.projectcalico.org/podIP: 10.42.3.194/32,cni.projectcalico.org/podIPs: 10.42.3.194/32,},OwnerReferences:[{apps/v1 ReplicaSet p-d049b5a6-d025-11ec-9bf1-12cc16fb82ca-85557f5c56 78613520-9fe5-4eed-aedb-477bb311fc9f 0xc420edb1f0 0xc420edb1f1}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-d049b5a6-d025-11ec-9bf1-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/clamp:amd-2.4 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountSer viceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420edb200} {node.kubernetes.io/unreachable Exists NoExecute 0xc420edb208}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Running,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:48 +0000 UTC } {Ready True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:50 +0000 UTC } {ContainersReady True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:50 +0000 UTC } {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:48 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:10.42.3.194,StartTime:2022-09-05 13:31:48 +0000 UTC,ContainerStatuses:[{c-d049b5a6-d025-11ec-9bf1-12cc16fb82ca {nil ContainerStateRunning{StartedAt:2022-09-05 13:31:49 +0000 UTC,} nil} {nil nil nil} true 0 registry.kk.com/ai/offline_function_tensorrt/clamp:amd-2.4 docker-pullable://registry.kk.com/ai/offline_function_tensorrt/clamp@sha256:f6c4bec5083a5f06d5f3caad1829a4b60edd9aa1497e740dd875f223f305dc8e docker://6bc4196f459adc286a8d3b7f501b2fda1cc398d8a524212905a5abb02ef0030a}],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 545fe3a8-1ce9-43c8-8bd3-63fa246daceb:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-27fef2d4-d026-11ec-9e71-12cc16fb82ca-6cb96fbb47-n2njd,GenerateName:p-27fef2d4-d026-11ec-9e71-12cc16fb82ca-6cb96fbb47-,Namespace:ai-model,SelfLink:,UID:545fe3a8-1ce9-43c8-8bd3-63fa246daceb,ResourceVersion:60854186,Generation:0,CreationTimestamp:2022-09-05 13:33:32 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 27fef2d4-d026-11ec-9e71-12cc16fb82ca,pod-template-hash: 6cb96fbb47,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384812800850122,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cattle.io/timestamp: 2022-09-02T09:09:18Z,field.cattle.io/ports: [[{"containerP ort":7070,"dnsName":"p-27fef2d4-d026-11ec-9e71-12cc16fb82ca","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],workload.cattle.io/state: {"d29ya2VyMQ==":"local:machine-95hdb"},},OwnerReferences:[{apps/v1 ReplicaSet p-27fef2d4-d026-11ec-9e71-12cc16fb82ca-6cb96fbb47 f57acef1-56aa-4a7c-81d6-3120f705d370 0xc42207af18 0xc42207af19}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-27fef2d4-d026-11ec-9e71-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/respirator:amd-2.3 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil, SchedulerName:default-scheduler ,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42207af28} {node.kubernetes.io/unreachable Exists NoExecute 0xc42207af30}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:32 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:32 +0000 UTC ContainersNotReady containers with unready status: [c-27fef2d4-d026-11ec-9e71-12cc16fb82ca]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:32 +0000 UTC ContainersNotReady containers with unready status: [c-27fef2d4-d026-11ec-9e71-12cc16fb82ca]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:32 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:33:32 +0000 UTC,ContainerStatuses:[{c-27fef2d4-d026-11ec-9e71-12cc16fb82ca {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/respirator:amd-2.3 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 4f68c68e-df2a-4846-8ce7-2dd1b06fd60a:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-fa4f5964-d025-11ec-b57d-12cc16fb82ca-6fbff6899f-kvdft,GenerateName:p-fa4f5964-d025-11ec-b57d-12cc16fb82ca-6fbff6899f-,Namespace:ai-model,SelfLink:,UID:4f68c68e-df2a-4846-8ce7-2dd1b06fd60a,ResourceVersion:60853654,Generation:0,CreationTimestamp:2022-09-05 13:32:05 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: fa4f5964-d025-11ec-b57d-12cc16fb82ca,pod-template-hash: 6fbff6899f,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384725280351691,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-fa4f5964-d025-11ec-b 57d-12cc16fb82ca-6fbff6899f 610e2e85-1ea0-462d-8c59-fb6ad3497653 0xc420e89b98 0xc420e89b99}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-fa4f5964-d025-11ec-b57d-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/toggle_switch:amd-2.5 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker1,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420e89bb8} {node.kubernetes.io/unreachable Exists NoExecute 0xc420e89bc0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:32:05 +0000 UTC } {Ready Fals e 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:32:05 +0000 UTC ContainersNotReady containers with unready status: [c-fa4f5964-d025-11ec-b57d-12cc16fb82ca]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:32:05 +0000 UTC ContainersNotReady containers with unready status: [c-fa4f5964-d025-11ec-b57d-12cc16fb82ca]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:32:05 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.14,PodIP:,StartTime:2022-09-05 13:32:05 +0000 UTC,ContainerStatuses:[{c-fa4f5964-d025-11ec-b57d-12cc16fb82ca {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/toggle_switch:amd-2.5 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}], and its address is 0xc4203babc0 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-56c0cd7c-d026-11ec-9141-12cc16fb82ca-85f94fd96c-qb99p in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-026d8cd8-e0b6-11ec-b0ec-4a6aa9346189-c64b6cbb6-gwztm in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-ca735e4a-00fc-11ed-9340-667f53977afd-5fc78ccb9d-tnbt5 in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-a2ad1890-04b4-11ed-a69a-da2f2c420d44-675bd85754-8pl5z in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-3f3b5a0a-e0b6-11ec-bd5a-4a6aa9346189-784bbcf688-mmb56 in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-b521d0ac-d4c0-11ec-8214-4a2f8bc05280-8767b6d44-jmdkr in ns ai-model with status Running has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-d049b5a6-d025-11ec-9bf1-12cc16fb82ca-85557f5c56-knqln in ns ai-model with status Running has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-27fef2d4-d026-11ec-9e71-12cc16fb82ca-6cb96fbb47-n2njd in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-fa4f5964-d025-11ec-b57d-12cc16fb82ca-6fbff6899f-kvdft in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-ad4a9c10-d5c4-11ec-a679-664243c08fda-64bc47bf7f-9h4sd in ns ai-model with status Pending has GPU Mem 1000 [ info ] 2022/09/05 13:42:11 nodeinfo.go:413: getUsedGPUs: map[0:10000] in node worker1, and devs map[0:0xc4203babc0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:431: try to find unhealthy node unhealthy-gpu-worker1 [ info ] 2022/09/05 13:42:11 nodeinfo.go:397: available GPU list map[0:2288] before removing unhealty GPUs [ info ] 2022/09/05 13:42:11 nodeinfo.go:402: available GPU list map[0:2288] after removing unhealty GPUs [ debug ] 2022/09/05 13:42:11 nodeinfo.go:162: AvailableGPUs: map[0:2288] in node worker1 [ info ] 2022/09/05 13:42:11 gpushare-predicate.go:31: The pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in the namespace ai-model can be scheduled on worker1 [ info ] 2022/09/05 13:42:11 gpushare-predicate.go:17: check if the pod name p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx can be scheduled on node worker2 [ info ] 2022/09/05 13:42:11 cache.go:160: GetNodeInfo() uses the existing nodeInfo for worker2 [ debug ] 2022/09/05 13:42:11 cache.go:162: node worker2 with devices map[0:0xc42215caa0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:423: getAllGPUs: map[0:12288] in node worker2, and dev map[0:0xc42215caa0] [ debug ] 2022/09/05 13:42:11 deviceinfo.go:42: GetUsedGPUMemory() podMap map[1cbc9c36-73f5-4e2e-8ecf-9206a409d69b:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-c62f1b76-0408-11ed-9340-667f53977afd-6d45556b4d-bddqz,GenerateName:p-c62f1b76-0408-11ed-9340-667f53977afd-6d45556b4d-,Namespace:ai-model,SelfLink:,UID:1cbc9c36-73f5-4e2e-8ecf-9206a409d69b,ResourceVersion:60853218,Generation:0,CreationTimestamp:2022-09-05 13:31:21 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: c62f1b76-0408-11ed-9340-667f53977afd,pod-template-hash: 6d45556b4d,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384681501299370,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-c62f1b76-0408-11ed-9340-667f53977afd-6d45556b4d 40ab5179-2c75-44f4-bd6c-0a84fdd29c93 0xc4208a6c18 0xc4208a6c19}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-c62f1b76-0408-11ed-9340-667f53977afd registry.kk.com/ai/offline_function_tensorrt/volt_meter:amd-2.8 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}], RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4208a6c28} {node.kubernetes.io/unreachable Exists NoExecute 0xc4208a6c30}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:21 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:21 +0000 UTC ContainersNotReady containers with unready status: [c-c62f1b76-0408-11ed-9340-667f53977afd]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:21 +0000 UTC ContainersNotReady containers with unready status: [c-c62f1b76-0408-11ed-9340-667f53977afd]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:31:21 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.15,PodIP:,StartTime:2022-09-05 13:31:21 +0000 UTC,ContainerStatuses:[{c-c62f1b76-0408-11ed-9340-667f53977afd {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/volt_meter:amd-2.8 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 72301f05-54d4-454b-96ea-96d1180620b2:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-775af1d0-d025-11ec-b79a-12cc16fb82ca-655b867d44-zzdbr,GenerateName:p-775af1d0-d025-11ec-b79a-12cc16fb82ca-655b867d44-,Namespace:ai-model,SelfLink:,UID:72301f05-54d4-454b-96ea-96d1180620b2,ResourceVersion:60854311,Generation:0,Cr 13:33:47 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 775af1d0-d025-11ec-b79a-12cc16fb82ca,pod-template-hash: 655b867d44,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1662384828135383512,ALIYUN_COM_GPU_MEM_DEV: 12288,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cattle.io/timestamp: 2022-06-15T08:43:32Z,field.cattle.io/ports: [[{"containerPort":7070,"dnsName":"p-775af1d0-d025-11ec-b79a-12cc16fb82ca","hostPort":0,"kind":"ClusterIP","name":"port-7070","protocol":"TCP","sourcePort":0}]],},OwnerReferences:[{apps/v1 ReplicaSet p-775af1d0-d025-11ec-b79a-12cc16fb82ca-655b867d44 71570ef2-b354-4ff5-b41a-a4f70dbf8b1a 0xc42207a808 0xc42207a809}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-zt2rw {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-zt2rw,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-775af1d0-d025-11ec-b79a-12cc16fb82ca registry.kk.com/ai/offline_function_tensorrt/oil_level:amd-2.4 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio-k9jzg.bps:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_SECRET_KEY nil nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.bps:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-zt2rw true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSecon ds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42207a818} {node.kubernetes.io/unreachable Exists NoExecute 0xc42207a820}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:48 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:48 +0000 UTC ContainersNotReady containers with unready status: [c-775af1d0-d025-11ec-b79a-12cc16fb82ca]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:48 +0000 UTC ContainersNotReady containers with unready status: [c-775af1d0-d025-11ec-b79a-12cc16fb82ca]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-09-05 13:33:48 +0000 UTC }],Message:,Reason:,HostIP:192.168.30.15,PodIP:,StartTime:2022-09-05 13:33:48 +0000 UTC,ContainerStatuses:[{c-775af1d0-d025-11ec-b79a-12cc16fb82ca {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 registry.kk.com/ai/offline_function_tensorrt/oil_level:amd-2.4 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}], and its address is 0xc42215caa0 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-775af1d0-d025-11ec-b79a-12cc16fb82ca-655b867d44-zzdbr in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/09/05 13:42:11 pod.go:107: pod p-c62f1b76-0408-11ed-9340-667f53977afd-6d45556b4d-bddqz in ns ai-model with status Pending has GPU Mem 1000 [ info ] 2022/09/05 13:42:11 nodeinfo.go:413: getUsedGPUs: map[0:2000] in node worker2, and devs map[0:0xc42215caa0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:431: try to find unhealthy node unhealthy-gpu-worker2 [ info ] 2022/09/05 13:42:11 nodeinfo.go:397: available GPU list map[0:10288] before removing unhealty GPUs [ info ] 2022/09/05 13:42:11 nodeinfo.go:402: available GPU list map[0:10288] after removing unhealty GPUs [ debug ] 2022/09/05 13:42:11 nodeinfo.go:162: AvailableGPUs: map[0:10288] in node worker2 [ info ] 2022/09/05 13:42:11 gpushare-predicate.go:31: The pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in the namespace ai-model can be scheduled on worker2 [ info ] 2022/09/05 13:42:11 routes.go:93: gpusharingfilter extenderFilterResult = {"Nodes":null,"NodeNames":["worker1","worker2"],"FailedNodes":{},"Error":""} [ debug ] 2022/09/05 13:42:11 routes.go:162: /gpushare-scheduler/filter response=&{0xc420fcc140 0xc420fd2000 0xc420fc2480 0x565cc0 true false false false 0xc420fc2580 {0xc4203d2a80 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 76 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc421faed90 0} [ debug ] 2022/09/05 13:42:11 routes.go:160: /gpushare-scheduler/bind request body = &{0xc421206c80 <nil> <nil> false true {0 0} false false false 0x69c120} [ debug ] 2022/09/05 13:42:11 routes.go:121: gpusharingBind ExtenderArgs ={p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx ai-model 4c24f578-1a0c-417b-adb0-76b0b2604c06 worker1} [ info ] 2022/09/05 13:42:11 cache.go:160: GetNodeInfo() uses the existing nodeInfo for worker1 [ debug ] 2022/09/05 13:42:11 cache.go:162: node worker1 with devices map[0:0xc4203babc0] [ info ] 2022/09/05 13:42:11 nodeinfo.go:184: Allocate() ----Begin to allocate GPU for gpu mem for pod p-00c54b7a-d015-11ec-8db7-12cc16fb82ca-6f8f5c45dc-j4zxx in ns ai-model----

在执行完/gpushare-scheduler/filter后通过这里打印能看出直接调度到worker1了,实际worker2比worker1更有充足的资源 image 有没有什么办法解决该问题呢,是部署问题还是代码bug呢

whybeyoung commented 1 year ago

the same error here