AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.36k stars 303 forks source link

单机双显卡时,调度器显示绑定到了不同的显卡上,实际全部都调度到了一张显卡上 #184

Open 1003111014 opened 1 year ago

1003111014 commented 1 year ago

调度器日志截图,显示两块显卡都进行了分配,用nvidia-smi查看gpu内存,实际全部都调度到了第一张显卡 screenshot-20220825-100950 screenshot-20220825-101009

screenshot-20220825-100303 getUsedGPUs: map[0:1000 1:4512] in node worker2, and devs map[1:0xc420629b00 0:0xc420629ae0] [ info ] 2022/08/25 01:57:36 nodeinfo.go:431: try to find unhealthy node unhealthy-gpu-worker2 [ info ] 2022/08/25 01:57:36 nodeinfo.go:397: available GPU list map[0:6979 1:3467] before removing unhealty GPUs [ info ] 2022/08/25 01:57:36 nodeinfo.go:402: available GPU list map[0:6979 1:3467] after removing unhealty GPUs [ debug ] 2022/08/25 01:57:36 nodeinfo.go:162: AvailableGPUs: map[0:6979 1:3467] in node worker2 [ info ] 2022/08/25 01:57:36 gpushare-predicate.go:31: The pod p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in the namespace ai-model can be scheduled on worker2 [ info ] 2022/08/25 01:57:36 routes.go:93: gpusharingfilter extenderFilterResult = {"Nodes":null,"NodeNames":["worker2"],"FailedNodes":{},"Error":""} [ debug ] 2022/08/25 01:57:36 routes.go:162: /gpushare-scheduler/filter response=&{0xc4203820a0 0xc420402300 0xc420e45b80 0x565cc0 true false false false 0xc420e45d00 {0xc4202fe000 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 66 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc42037ad90 0} [ debug ] 2022/08/25 01:57:36 routes.go:160: /gpushare-scheduler/bind request body = &{0xc420e78b20 <nil> <nil> false true {0 0} false false false 0x69c120} [ debug ] 2022/08/25 01:57:36 routes.go:121: gpusharingBind ExtenderArgs ={p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b ai-model cb2866b3-bb73-47ca-898e-1652d41d79c0 worker2} [ info ] 2022/08/25 01:57:36 cache.go:160: GetNodeInfo() uses the existing nodeInfo for worker2 [ debug ] 2022/08/25 01:57:36 cache.go:162: node worker2 with devices map[0:0xc420629ae0 1:0xc420629b00] [ info ] 2022/08/25 01:57:36 nodeinfo.go:184: Allocate() ----Begin to allocate GPU for gpu mem for pod p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in ns ai-model---- [ info ] 2022/08/25 01:57:36 nodeinfo.go:423: getAllGPUs: map[0:7979 1:7979] in node worker2, and dev map[0:0xc420629ae0 1:0xc420629b00] [ debug ] 2022/08/25 01:57:36 deviceinfo.go:42: GetUsedGPUMemory() podMap map[7c8baa1c-38d3-426c-b613-6673ef6cdd03:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-f407d20e-239d-11ed-8a9a-b2a44a4d350f-8556669ddf-w9crj,GenerateName:p-f407d20e-239d-11ed-8a9a-b2a44a4d350f-8556669ddf-,Namespace:ai-model,SelfLink:,UID:7c8baa1c-38d3-426c-b613-6673ef6cdd03,ResourceVersion:87201592,Generation:0,CreationTimestamp:2022-08-24 11:43:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: f407d20e-239d-11ed-8a9a-b2a44a4d350f,pod-template-hash: 8556669ddf,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661341380164154462,ALIYUN_COM_GPU_MEM_DEV: 7979,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cni.projectcalico.org/podIP: 10.42.1.242/32,cni.projectcalico.org/podIPs: 10.42.1.242/32,},OwnerReferences:[{apps/v1 ReplicaSet p-f407d20e-239d-11ed-8a9a-b2a44a4d350f-8556669ddf 2c580ca3-1b79-46f3-a869-15dd1fedc74c 0xc4206ba7a0 0xc4206ba7a1}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-qg7xz {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-qg7xz,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-f407d20e-239d-11ed-8a9a-b2a44a4d350f reg.tx.com/ai/offline_function_tensorrt/clamp:amd-2.6 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio.minio:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTFXXX nil} {MINIO_SECRET_KEY wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAXXXX nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.lis-test:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-qg7xz true /var/run/secrets/kubernetes.io/serviceaccount <n il>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4206ba7b0} {node.kubernetes.io/unreachable Exists NoExecute 0xc4206ba7b8}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:43:00 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:43:00 +0000 UTC ContainersNotReady containers with unready status: [c-f407d20e-239d-11ed-8a9a-b2a44a4d350f]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:43:00 +0000 UTC ContainersNotReady containers with unready status: [c-f407d20e-239d-11ed-8a9a-b2a44a4d350f]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:43:00 +0000 UTC }],Message:,Reason:,HostIP:192.168.3.15,PodIP:,StartTime:2022-08-24 11:43:00 +0000 UTC,ContainerStatuses:[{c-f407d20e-239d-11ed-8a9a-b2a44a4d350f {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 reg.tx.com/ai/offline_function_tensorrt/clamp:amd-2.6 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}], and its address is 0xc420629ae0 [ debug ] 2022/08/25 01:57:36 pod.go:107: pod p-f407d20e-239d-11ed-8a9a-b2a44a4d350f-8556669ddf-w9crj in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/08/25 01:57:36 deviceinfo.go:42: GetUsedGPUMemory() podMap map[2c38f29c-ecb0-4991-99e5-f6daf12e8e43:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-3bdd5ad6-1d54-11ed-bf40-dabc71e5c302-669f6bd48-bhvkr,GenerateName:p-3bdd5ad6-1d54-11ed-bf40-dabc71e5c302-669f6bd48-,Namespace:ai-model,SelfLink:,UID:2c38f29c-ecb0-4991-99e5-f6daf12e8e43,ResourceVersion:87185439,Generation:0,CreationTimestamp:2022-08-24 11:08:30 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 3bdd5ad6-1d54-11ed-bf40-dabc71e5c302,pod-template-hash: 669f6bd48,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661339310868533391,ALIYUN_COM_GPU_MEM_DEV: 7979,ALIYUN_COM_GPU_MEM_IDX: 1,ALIYUN_COM_GPU_MEM_POD: 1000,cni.projectcalico.org/podIP: 10.42.1.231/32,cni.projectcalico.org/podIPs: 10.42.1.231/32,},OwnerReferences:[{apps/v1 ReplicaSet p-3bdd5ad6-1d54-11ed-bf40-dabc71e5c302-669f6bd48 7d75cebd-fd97-4f53-a892-01655fba71f2 0xc42069ab30 0xc42069ab31}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-qg7xz {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-qg7xz,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-3bdd5ad6-1d54-11ed-bf40-dabc71e5c302 reg.tx.com/ai/offline_function_tensorrt/bird_nest:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio.minio:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTFXXX nil} {MINIO_SECRET_KEY wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAXXXX nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.lis-test:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-qg7xz true /var/run/secrets/kubernetes.io/serviceaccount <n il>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc42069ab40} {node.kubernetes.io/unreachable Exists NoExecute 0xc42069ab48}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:08:30 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:08:30 +0000 UTC ContainersNotReady containers with unready status: [c-3bdd5ad6-1d54-11ed-bf40-dabc71e5c302]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:08:30 +0000 UTC ContainersNotReady containers with unready status: [c-3bdd5ad6-1d54-11ed-bf40-dabc71e5c302]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:08:30 +0000 UTC }],Message:,Reason:,HostIP:192.168.3.15,PodIP:,StartTime:2022-08-24 11:08:30 +0000 UTC,ContainerStatuses:[{c-3bdd5ad6-1d54-11ed-bf40-dabc71e5c302 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 reg.tx.com/ai/offline_function_tensorrt/bird_nest:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 030a837d-cdc8-49ff-9b4b-1267bd6540a2:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-427f8ad8-1d56-11ed-bf40-dabc71e5c302-7777bdd858-rg894,GenerateName:p-427f8ad8-1d56-11ed-bf40-dabc71e5c302-7777bdd858-,Namespace:ai-model,SelfLink: ,UID:030a837d-cdc8-49ff-9b4b-1267bd6540a2,ResourceVersion:87185550,Generation:0,CreationTimestamp:2022-08-24 11:08:42 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 427f8ad8-1d56-11ed-bf40-dabc71e5c302,pod-template-hash: 7777bdd858,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661339322298410169,ALIYUN_COM_GPU_MEM_DEV: 7979,ALIYUN_COM_GPU_MEM_IDX: 1,ALIYUN_COM_GPU_MEM_POD: 1000,cni.projectcalico.org/podIP: 10.42.1.232/32,cni.projectcalico.org/podIPs: 10.42.1.232/32,},OwnerReferences:[{apps/v1 ReplicaSet p-427f8ad8-1d56-11ed-bf40-dabc71e5c302-7777bdd858 1ff29496-f9aa-4f20-b10a-07cbc2c52924 0xc422290c00 0xc422290c01}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-qg7xz {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-qg7xz,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-427f8ad8-1d56-11ed-bf40-dabc71e5c302 reg.tx.com/ai/offline_function_tensorrt/rust:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio.minio:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTFXXX nil} {MINIO_SECRET_KEY wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAXXXX nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.lis-test:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-qg7xz true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinu xOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc422290c10} {node.kubernetes.io/unreachable Exists NoExecute 0xc422290c18}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:08:42 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:08:42 +0000 UTC ContainersNotReady containers with unready status: [c-427f8ad8-1d56-11ed-bf40-dabc71e5c302]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:08:42 +0000 UTC ContainersNotReady containers with unready status: [c-427f8ad8-1d56-11ed-bf40-dabc71e5c302]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:08:42 +0000 UTC }],Message:,Reason:,HostIP:192.168.3.15,PodIP:,StartTime:2022-08-24 11:08:42 +0000 UTC,ContainerStatuses:[{c-427f8ad8-1d56-11ed-bf40-dabc71e5c302 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 reg.tx.com/ai/offline_function_tensorrt/rust:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 79ee0235-6df0-4d9c-9099-bf5512da3171:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:binpack-1-6f4d6d4ff5-bw2b6,GenerateName:binpack-1-6f4d6d4ff5-,Namespace:default,SelfLink:,UID:79ee0235-6df0-4d9c-9099-bf5512da3171,ResourceVersion:86588075,Generation:0,CreationTimestamp:2022-08-23 11:13:24 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: binpack-1,pod-template-hash: 6f4d6d4ff5,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661255100544398707,ALIYUN_COM_GPU_MEM_DEV: 7979,ALIYUN_COM_GPU_MEM_IDX: 1,ALIYUN_COM_GPU_MEM_POD: 512,cattle.io/timestamp: 2022-04-26T07:39:18Z,field.cattle.io/ports: [[{"containerPort":7070,"dnsName":"binpack-1","hostPort":0,"kind":"ClusterIP","name":"7070tcp02","protocol":"TCP","sourcePort":0}]],},OwnerReferences:[{apps/v1 ReplicaSet binpack-1-6f4d6d4ff5 e8b1ff38-189a-4147-8627-102a7dc80272 0xc4210bd580 0xc4210bd581}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-mjrc5 {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-mjrc5,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{binpack-1 reg.tx.com/ai/offline_function_tensorrt/light:amd-2.8 [] [] [{7070tcp02 0 7070 TCP }] [] [{CGPU_DISABLE true nil} {DATA_CALLBACK_URL http://ai-service.lis-dev:8000/ai/model/process/data nil} {FLASK_PORT 7070 nil} {KAFKA_IP 172.16.2.88:9092 nil} {KAFKA_TOPIC_RECIEVE ai.train.report nil} {KAFKA_TOPIC_SEND ai.train nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTF nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_ENDPOINT minio-dev.bot-patrol.com nil} {MINIO_SECRET_KEY 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 nil} {MINIO_SECURE False nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{512 0} {<nil>} 512 DecimalSI}] map[aliyun.com/gpu-mem:{{512 0} {<nil>} 512 DecimalSI}]} [{default-token-mjrc5 true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent SecurityContext{Capabilities:&Capabilities{Add:[],Drop:[],},Privileged:nil,SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,ReadOnlyRootFilesystem:nil,AllowPrivilegeEscalation:nil,RunAsGroup:nil,} false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FS Group:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4210bd590} {node.kubernetes.io/unreachable Exists NoExecute 0xc4210bd598}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:&PodDNSConfig{Nameservers:[],Searches:[],Options:[],},ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-08-23 11:45:00 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-08-23 11:45:00 +0000 UTC ContainersNotReady containers with unready status: [binpack-1]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-08-23 11:45:00 +0000 UTC ContainersNotReady containers with unready status: [binpack-1]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-23 11:45:00 +0000 UTC }],Message:,Reason:,HostIP:192.168.3.15,PodIP:,StartTime:2022-08-23 11:45:00 +0000 UTC,ContainerStatuses:[{binpack-1 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 reg.tx.com/ai/offline_function_tensorrt/light:amd-2.8 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} cb9d1880-5022-4a30-9dd7-e6e047675497:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-73862152-1d4a-11ed-bf40-dabc71e5c302-954b9cbd5-zfwsx,GenerateName:p-73862152-1d4a-11ed-bf40-dabc71e5c302-954b9cbd5-,Namespace:ai-model,SelfLink:,UID:cb9d1880-5022-4a30-9dd7-e6e047675497,ResourceVersion:87122660,Generation:0,CreationTimestamp:2022-08-24 08:38:20 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 73862152-1d4a-11ed-bf40-dabc71e5c302,pod-template-hash: 954b9cbd5,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661330300863721341,ALIYUN_COM_GPU_MEM_DEV: 7979,ALIYUN_COM_GPU_MEM_IDX: 1,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReference s:[{apps/v1 ReplicaSet p-73862152-1d4a-11ed-bf40-dabc71e5c302-954b9cbd5 ca34bd09-1d44-4d88-a484-6d22cf477940 0xc420738ea0 0xc420738ea1}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-qg7xz {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-qg7xz,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-73862152-1d4a-11ed-bf40-dabc71e5c302 reg.tx.com/ai/offline_function_tensorrt/meter_sf6:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio.minio:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTFXXX nil} {MINIO_SECRET_KEY wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAXXXX nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.lis-test:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-qg7xz true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420738eb0} {node.kubernetes.io/unreachable Exists NoExecute 0xc420738eb8}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 202 08:38:20 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 08:38:20 +0000 UTC ContainersNotReady containers with unready status: [c-73862152-1d4a-11ed-bf40-dabc71e5c302]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 08:38:20 +0000 UTC ContainersNotReady containers with unready status: [c-73862152-1d4a-11ed-bf40-dabc71e5c302]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 08:38:20 +0000 UTC }],Message:,Reason:,HostIP:192.168.3.15,PodIP:,StartTime:2022-08-24 08:38:20 +0000 UTC,ContainerStatuses:[{c-73862152-1d4a-11ed-bf40-dabc71e5c302 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 reg.tx.com/ai/offline_function_tensorrt/meter_sf6:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} a2b53218-05cf-4d93-a8e3-d3b1945e1058:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-7425d432-1d49-11ed-bf40-dabc71e5c302-7c8b4b458f-4qnx7,GenerateName:p-7425d432-1d49-11ed-bf40-dabc71e5c302-7c8b4b458f-,Namespace:ai-model,SelfLink:,UID:a2b53218-05cf-4d93-a8e3-d3b1945e1058,ResourceVersion:87122965,Generation:0,CreationTimestamp:2022-08-24 08:38:55 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 7425d432-1d49-11ed-bf40-dabc71e5c302,pod-template-hash: 7c8b4b458f,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661330335577141303,ALIYUN_COM_GPU_MEM_DEV: 7979,ALIYUN_COM_GPU_MEM_IDX: 1,ALIYUN_COM_GPU_MEM_POD: 1000,cni.projectcalico.org/podIP: 10.42.1.230/32,cni.projectcalico.org/podIPs: 10.42.1.230/32,},OwnerReferences:[{apps/v1 ReplicaSet p-7425d432-1d49-11ed-bf40-dabc71e5c302-7c8b4b458f eaf357e4-6727-46ab-8222-eb8cbfc8c3ac 0xc420adfc20 0xc420adfc21}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-qg7xz {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-qg7xz,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil ni l nil nil}}],Containers:[{c-7425d432-1d49-11ed-bf40-dabc71e5c302 reg.tx.com/ai/offline_function_tensorrt/respirator:amd-2.2 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio.minio:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTFXXX nil} {MINIO_SECRET_KEY wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAXXXX nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.lis-test:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-qg7xz true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420adfc30} {node.kubernetes.io/unreachable Exists NoExecute 0xc420adfc38}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 08:38:55 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 08:38:55 +0000 UTC ContainersNotReady containers with unready status: [c-7425d432-1d49-11ed-bf40-dabc71e5c302]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 08:38:55 +0000 UTC ContainersNotReady containers with unready status: [c-7425d432-1d49-11ed-bf40-dabc71e5c302]} {PodSche duled True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 08:38:55 +0000 UTC }],Message:,Reason:,HostIP:192.168.3.15,PodIP:,StartTime:2022-08-24 08:38:55 +0000 UTC,ContainerStatuses:[{c-7425d432-1d49-11ed-bf40-dabc71e5c302 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 reg.tx.com/ai/offline_function_tensorrt/respirator:amd-2.2 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}], and its address is 0xc420629b00 [ debug ] 2022/08/25 01:57:36 pod.go:107: pod binpack-1-6f4d6d4ff5-bw2b6 in ns default with status Pending has GPU Mem 512 [ debug ] 2022/08/25 01:57:36 pod.go:107: pod p-73862152-1d4a-11ed-bf40-dabc71e5c302-954b9cbd5-zfwsx in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/08/25 01:57:36 pod.go:107: pod p-7425d432-1d49-11ed-bf40-dabc71e5c302-7c8b4b458f-4qnx7 in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/08/25 01:57:36 pod.go:107: pod p-3bdd5ad6-1d54-11ed-bf40-dabc71e5c302-669f6bd48-bhvkr in ns ai-model with status Pending has GPU Mem 1000 [ debug ] 2022/08/25 01:57:36 pod.go:107: pod p-427f8ad8-1d56-11ed-bf40-dabc71e5c302-7777bdd858-rg894 in ns ai-model with status Pending has GPU Mem 1000 [ info ] 2022/08/25 01:57:36 nodeinfo.go:413: getUsedGPUs: map[0:1000 1:4512] in node worker2, and devs map[0:0xc420629ae0 1:0xc420629b00] [ info ] 2022/08/25 01:57:36 nodeinfo.go:431: try to find unhealthy node unhealthy-gpu-worker2 [ info ] 2022/08/25 01:57:36 nodeinfo.go:397: available GPU list map[0:6979 1:3467] before removing unhealty GPUs [ info ] 2022/08/25 01:57:36 nodeinfo.go:402: available GPU list map[0:6979 1:3467] after removing unhealty GPUs [ info ] 2022/08/25 01:57:36 nodeinfo.go:321: reqGPU for pod p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in ns ai-model: 1000 [ info ] 2022/08/25 01:57:36 nodeinfo.go:322: AvailableGPUs: map[0:6979 1:3467] in node worker2 [ info ] 2022/08/25 01:57:36 nodeinfo.go:372: Find candidate dev id 0 for pod p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in ns ai-model successfully. [ info ] 2022/08/25 01:57:36 nodeinfo.go:188: Allocate() 1. Allocate GPU ID 0 to pod p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in ns ai-model.---- [ info ] 2022/08/25 01:57:36 nodeinfo.go:227: Allocate() 2. Try to bind pod p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in ai-model namespace to node with &Binding{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b,GenerateName:,Namespace:,SelfLink:,UID:cb2866b3-bb73-47ca-898e-1652d41d79c0,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Target:ObjectReference{Kind:Node,Namespace:,Name:worker2,UID:,APIVersion:,ResourceVersion:,FieldPath:,},} [ info ] 2022/08/25 01:57:36 controller.go:297: Need to update pod name p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in ns ai-model and old status is Pending, new status is Pending; its old annotation map[] and new annotation map[ALIYUN_COM_GPU_MEM_IDX:0 ALIYUN_COM_GPU_MEM_POD:1000 ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661392656240237376 ALIYUN_COM_GPU_MEM_DEV:7979] [ info ] 2022/08/25 01:57:36 nodeinfo.go:241: Allocate() 3. Try to add pod p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in ns ai-model to dev 0 [ debug ] 2022/08/25 01:57:36 deviceinfo.go:57: dev.addPod() Pod p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in ns ai-model with the GPU ID 0 will be added to device map [ debug ] 2022/08/25 01:57:36 deviceinfo.go:64: dev.addPod() after updated is map[cb2866b3-bb73-47ca-898e-1652d41d79c0:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b,GenerateName:p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-,Namespace:ai-model,SelfLink:,UID:cb2866b3-bb73-47ca-898e-1652d41d79c0,ResourceVersion:87560153,Generation:0,CreationTimestamp:2022-08-25 01:57:36 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: 0011eac6-1f84-11ed-b00d-ee11b204c376,pod-template-hash: 78dfc9cd48,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: false,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661392656240237376,ALIYUN_COM_GPU_MEM_DEV: 7979,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,},OwnerReferences:[{apps/v1 ReplicaSet p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48 577b8c42-2fa6-4bfc-9eb2-4658deff6bea 0xc420eda3f7 0xc420eda3f8}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-qg7xz {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-qg7xz,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-0011eac6-1f84-11ed-b00d-ee11b204c376 reg.tx.com/ai/offline_function_tensorrt/toggle_switch:amd-2.3 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio.minio:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTFXXX nil} {MINIO_SECRET_KEY wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAXXXX nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.lis-test:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-qg7xz true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc420edab40} {node.kubernetes.io/unreachable Exists NoExecute 0xc420edab60}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} 7c8baa1c-38d3-426c-b613-6673ef6cdd03:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:p-f407d20e-239d-11ed-8a9a-b2a44a4d350f-8556669ddf-w9crj,GenerateName:p-f407d20e-239d-11ed-8a9a-b2a44a4d350f-8556669ddf-,Namespace:ai-model,SelfLink:,UID:7c8baa1c-38d3-426c-b613-6673ef6cdd03,ResourceVersion:87201592,Generation:0,CreationTimestamp:2022-08-24 11:43:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: f407d20e-239d-11ed-8a9a-b2a44a4d350f,pod-template-hash: 8556669ddf,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: true,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661341380164154462,ALIYUN_COM_GPU_MEM_DEV: 7979,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 1000,cni.projectcalico.org/podIP: 10.42.1.242/32,cni.projectcalico.org/podIPs: 10.42.1.242/32,},OwnerReferences:[{apps/v1 ReplicaSet p-f407d20e-239d-11ed-8a9a-b2a44a4d350f-8556669ddf 2c580ca3-1b79-46f3-a869-15dd1fedc74c 0xc4206ba7a0 0xc4206ba7a1}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-qg7xz {nil n il nil nil nil SecretVolumeSource{SecretName:default-token-qg7xz,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{c-f407d20e-239d-11ed-8a9a-b2a44a4d350f reg.tx.com/ai/offline_function_tensorrt/clamp:amd-2.6 [] [] [{port-7070 0 7070 TCP }] [] [{MINIO_ENDPOINT minio.minio:9000 nil} {MINIO_ACCESS_KEY AKIAIOSFODNN7EXAMGTFXXX nil} {MINIO_SECRET_KEY wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAXXXX nil} {MINIO_BUCKET_NAME aibucket nil} {MINIO_SECURE False nil} {DATA_CALLBACK_URL http://ai-service.lis-test:8000/ai/model/process/data nil} {NVIDIA_VISIBLE_DEVICES all nil}] {map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}] map[aliyun.com/gpu-mem:{{1 3} {<nil>} 1k DecimalSI}]} [{default-token-qg7xz true /var/run/secrets/kubernetes.io/serviceaccount <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:worker2,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists NoExecute 0xc4206ba7b0} {node.kubernetes.io/unreachable Exists NoExecute 0xc4206ba7b8}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:43:00 +0000 UTC } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:43:00 +0000 UTC ContainersNotReady containers with unready status: [c-f407d20e-239d-11ed-8a9a-b2a44a4d350f]} {ContainersReady False 0001-01-01 00:00: +0000 UTC 2022-08-24 11:43:00 +0000 UTC ContainersNotReady containers with unready status: [c-f407d20e-239d-11ed-8a9a-b2a44a4d350f]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-24 11:43:00 +0000 UTC }],Message:,Reason:,HostIP:192.168.3.15,PodIP:,StartTime:2022-08-24 11:43:00 +0000 UTC,ContainerStatuses:[{c-f407d20e-239d-11ed-8a9a-b2a44a4d350f {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 reg.tx.com/ai/offline_function_tensorrt/clamp:amd-2.6 }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}], and its address is 0xc420629ae0 [ info ] 2022/08/25 01:57:36 nodeinfo.go:252: Allocate() ----End to allocate GPU for gpu mem for pod p-0011eac6-1f84-11ed-b00d-ee11b204c376-78dfc9cd48-npk9b in ns ai-model----

1003111014 commented 1 year ago

采用的代码:https://github.com/chenrulongmaster/gpushare-scheduler-extender