AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.39k stars 309 forks source link

pod包含多个container时报错: "unknown device id: no-gpu-has-5MiB-to-run" #185

Open serend1p1ty opened 2 years ago

serend1p1ty commented 2 years ago

很棒的插件,感谢大佬们的工作!

pod包含一个container正常工作,包含多个container时会报错,请问是什么原因呢?

Error: failed to start container "damo1": Error response from daemon: OCI runtime create failed: container_linux.go:348:starting container process caused "process_linux.go:402: container init caused \"process_linux.go:385: running prestart hook 0 caused \\\"error running hook: exit status 1, stdout: , stderr: nvidia-container-cli: device error: unknown device id: no-gpu-has-5MiB-to-run\\\\n\\\"\"": unknown

配置文件如下。

apiVersion: apps/v1
kind: Deployment

metadata:
  name: damo
  labels:
    app: damo
  namespace: default

spec:
  replicas: 1

  selector: # define how the deployment finds the pods it mangages
    matchLabels:
      app: damo

  template: # define the pods specifications
    metadata:
      labels:
        app: damo

    spec:
      containers:
      - name: damo1
         image: nvidia/cuda:10.0-base
         resources:
           limits:
             aliyun.com/gpu-mem: 5
      - name: damo2
         image: nvidia/cuda:10.0-base
         resources:
           limits:
             aliyun.com/gpu-mem: 5

gpushare-schd-extender日志

[ debug ] 2022/08/29 15:15:06 controller.go:176: begin to sync gpushare pod damo-7547d9b66d-mt2hf in ns default
[ debug ] 2022/08/29 15:15:06 cache.go:90: Add or update pod info: &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:damo-7547d9b66d-mt2hf,GenerateName:damo-7547d9b66d-,Namespace:default,SelfLink:/api/v1/namespaces/default/pods/damo-7547d9b66d-mt2hf,UID:efebe6cf-06cf-4805-abc0-8b74d7714bd9,ResourceVersion:2222600,Generation:0,CreationTimestamp:2022-08-2915:15:06 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: damo,pod-template-hash: 7547d9b66d,},Annotations:map[string]string{},OwnerReferences:[{apps/v1 ReplicaSet damo-7547d9b66d ab9e6b18-d71f-4974-83f6-984892bfcf0f 0xc420976c4a 0xc420976c4b}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-r2dqp {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-r2dqp,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nilnil nil nil}}],Containers:[{damo1 192.168.0.181:80/library/alibaba_damo_capture:4.0.0 [] []  [] [] [{APES_ADDR 192.168.0.199 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false} {damo2 nvidia/cuda:10.0-base [] []  [] [] [{APES_ADDR 192.168.0.200 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqptrue /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists  NoExecute 0xc420976e30} {node.kubernetes.io/unreachable Exists  NoExecute 0xc420976e50}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}
[ debug ] 2022/08/29 15:15:06 cache.go:91: Node map[centos7-vm:0xc4206d9580]
[ debug ] 2022/08/29 15:15:06 cache.go:93: pod damo-7547d9b66d-mt2hf in ns default is not assigned to any node, skip
[  info ] 2022/08/29 15:15:06 controller.go:223: end processNextWorkItem()
[ debug ] 2022/08/29 15:15:06 routes.go:160: /gpushare-scheduler/filter request body = &{0xc42002aee0 <nil> <nil> false true {0 0} false false false 0x69bfd0}
[ debug ] 2022/08/29 15:15:06 routes.go:81: gpusharingfilter ExtenderArgs ={&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:damo-7547d9b66d-mt2hf,GenerateName:damo-7547d9b66d-,Namespace:default,SelfLink:/api/v1/namespaces/default/pods/damo-7547d9b66d-mt2hf,UID:efebe6cf-06cf-4805-abc0-8b74d7714bd9,ResourceVersion:2222600,Generation:0,CreationTimestamp:2022-08-29 15:15:06 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: damo,pod-template-hash: 7547d9b66d,},Annotations:map[string]string{},OwnerReferences:[{apps/v1 ReplicaSet damo-7547d9b66d ab9e6b18-d71f-4974-83f6-984892bfcf0f 0xc42096680a 0xc42096680b}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-r2dqp {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-r2dqp,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{damo1 192.168.0.181:80/library/alibaba_damo_capture:4.0.0 [] []  [] [] [{APES_ADDR 192.168.0.199 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false} {damo2 nvidia/cuda:10.0-base [] []  [] [] [{APES_ADDR 192.168.0.200 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists  NoExecute 0xc4209668f0} {node.kubernetes.io/unreachable Exists  NoExecute 0xc420966910}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},} nil 0xc42002b1e0}
[ debug ] 2022/08/29 15:15:06 gpushare-predicate.go:17: check if the pod name damo-7547d9b66d-mt2hf can be scheduled on node centos7-vm
[ debug ] 2022/08/29 15:15:06 cache.go:155: GetNodeInfo() uses the existing nodeInfo for centos7-vm
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:282: getAllGPUs: map[0:14 1:14] in node centos7-vm, and dev map[0:0xc4204385e0 1:0xc420438660]
[ debug ] 2022/08/29 15:15:06 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc4204385e0
[ debug ] 2022/08/29 15:15:06 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc420438660
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:272: getUsedGPUs: map[0:0 1:0] in node centos7-vm, and devs map[0:0xc4204385e0 1:0xc420438660]
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:121: AvailableGPUs: map[0:14 1:14] in node centos7-vm
[ debug ] 2022/08/29 15:15:06 gpushare-predicate.go:31: The pod damo-7547d9b66d-mt2hf in the namespace default can be scheduled on centos7-vm
[  info ] 2022/08/29 15:15:06 routes.go:93: gpusharingfilter extenderFilterResult = {"Nodes":null,"NodeNames":["centos7-vm"],"FailedNodes":{},"Error":""}
[ debug ] 2022/08/29 15:15:06 routes.go:162: /gpushare-scheduler/filter response=&{0xc4200eabe0 0xc42063ea00 0xc42076b2c0 0x565b70 true false false false 0xc42076b340 {0xc420806a80 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 69 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc4204207e0 0}
[ debug ] 2022/08/29 15:15:06 routes.go:160: /gpushare-scheduler/bind request body = &{0xc420600b80 <nil> <nil> false true {0 0} false false false 0x69bfd0}
[ debug ] 2022/08/29 15:15:06 routes.go:121: gpusharingBind ExtenderArgs ={damo-7547d9b66d-mt2hf default efebe6cf-06cf-4805-abc0-8b74d7714bd9 centos7-vm}
[ debug ] 2022/08/29 15:15:06 cache.go:155: GetNodeInfo() uses the existing nodeInfo for centos7-vm
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:143: Allocate() ----Begin to allocate GPU for gpu mem for pod damo-7547d9b66d-mt2hf in ns default----
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:282: getAllGPUs: map[0:14 1:14] in node centos7-vm, and dev map[0:0xc4204385e0 1:0xc420438660]
[ debug ] 2022/08/29 15:15:06 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc4204385e0
[ debug ] 2022/08/29 15:15:06 deviceinfo.go:42: GetUsedGPUMemory() podMap map[], and its address is 0xc420438660
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:272: getUsedGPUs: map[0:0 1:0] in node centos7-vm, and devs map[0:0xc4204385e0 1:0xc420438660]
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:220: reqGPU for pod damo-7547d9b66d-mt2hf in ns default: 10
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:221: AvailableGPUs: map[1:14 0:14] in node centos7-vm
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:239: Find candidate dev id 0 for pod damo-7547d9b66d-mt2hf in ns default successfully.
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:147: Allocate() 1. Allocate GPU ID 0 to pod damo-7547d9b66d-mt2hf in ns default.----
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:179: Allocate() 2. Try to bind pod damo-7547d9b66d-mt2hf in default namespace to node  with &Binding{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:damo-7547d9b66d-mt2hf,GenerateName:,Namespace:,SelfLink:,UID:efebe6cf-06cf-4805-abc0-8b74d7714bd9,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,Initializers:nil,},Target:ObjectReference{Kind:Node,Namespace:,Name:centos7-vm,UID:,APIVersion:,ResourceVersion:,FieldPath:,},}
[  info ] 2022/08/29 15:15:06 controller.go:286: Need to update pod name damo-7547d9b66d-mt2hf in ns default and old status is Pending, new status is Pending; its old annotation map[] andnew annotation map[ALIYUN_COM_GPU_MEM_POD:10 ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661786106645608559 ALIYUN_COM_GPU_MEM_DEV:14 ALIYUN_COM_GPU_MEM_IDX:0]
[  info ] 2022/08/29 15:15:06 controller.go:286: Need to update pod name damo-7547d9b66d-mt2hf in ns default and old status is Pending, new status is Pending; its old annotation map[ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661786106645608559 ALIYUN_COM_GPU_MEM_DEV:14 ALIYUN_COM_GPU_MEM_IDX:0 ALIYUN_COM_GPU_MEM_POD:10] and new annotation map[ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661786106645608559 ALIYUN_COM_GPU_MEM_DEV:14 ALIYUN_COM_GPU_MEM_IDX:0 ALIYUN_COM_GPU_MEM_POD:10 ALIYUN_COM_GPU_MEM_ASSIGNED:false]
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:193: Allocate() 3. Try to add pod damo-7547d9b66d-mt2hf in ns default to dev 0
[ debug ] 2022/08/29 15:15:06 deviceinfo.go:57: dev.addPod() Pod damo-7547d9b66d-mt2hf in ns default with the GPU ID 0 will be added to device map
[ debug ] 2022/08/29 15:15:06 deviceinfo.go:64: dev.addPod() after updated is map[efebe6cf-06cf-4805-abc0-8b74d7714bd9:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:damo-7547d9b66d-mt2hf,GenerateName:damo-7547d9b66d-,Namespace:default,SelfLink:/api/v1/namespaces/default/pods/damo-7547d9b66d-mt2hf,UID:efebe6cf-06cf-4805-abc0-8b74d7714bd9,ResourceVersion:2222600,Generation:0,CreationTimestamp:2022-08-29 15:15:06 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: damo,pod-template-hash: 7547d9b66d,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: false,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661786106645608559,ALIYUN_COM_GPU_MEM_DEV: 14,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 10,},OwnerReferences:[{apps/v1 ReplicaSet damo-7547d9b66d ab9e6b18-d71f-4974-83f6-984892bfcf0f 0xc420b48b98 0xc420b48b99}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-r2dqp {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-r2dqp,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{damo1 192.168.0.181:80/library/alibaba_damo_capture:4.0.0 [] []  [] [] [{APES_ADDR 192.168.0.199 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false} {damo2 nvidia/cuda:10.0-base [] []  [] [] [{APES_ADDR 192.168.0.200 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists  NoExecute 0xc420b48ba8} {node.kubernetes.io/unreachable Exists  NoExecute 0xc420b48bb0}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[],Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}], and its address is 0xc4204385e0
[ debug ] 2022/08/29 15:15:06 nodeinfo.go:204: Allocate() ----End to allocate GPU for gpu mem for pod damo-7547d9b66d-mt2hf in ns default----
[  info ] 2022/08/29 15:15:06 routes.go:137: extenderBindingResult = {"Error":""}
[ debug ] 2022/08/29 15:15:06 routes.go:162: /gpushare-scheduler/bind response=&{0xc4200eabe0 0xc42028a900 0xc420876680 0x565b70 true false false false 0xc42071c980 {0xc4206eaee0 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 12 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc420362e00 0}
[  info ] 2022/08/29 15:15:06 controller.go:286: Need to update pod name damo-7547d9b66d-mt2hf in ns default and old status is Pending, new status is Pending; its old annotation map[ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661786106645608559 ALIYUN_COM_GPU_MEM_DEV:14 ALIYUN_COM_GPU_MEM_IDX:0 ALIYUN_COM_GPU_MEM_POD:10] and new annotation map[ALIYUN_COM_GPU_MEM_POD:10 ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661786106645608559 ALIYUN_COM_GPU_MEM_DEV:14 ALIYUN_COM_GPU_MEM_IDX:0]
[  info ] 2022/08/29 15:15:07 controller.go:210: begin processNextWorkItem()
[ debug ] 2022/08/29 15:15:07 controller.go:176: begin to sync gpushare pod damo-7547d9b66d-mt2hf in ns default
[ debug ] 2022/08/29 15:15:07 cache.go:90: Add or update pod info: &Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:damo-7547d9b66d-mt2hf,GenerateName:damo-7547d9b66d-,Namespace:default,SelfLink:/api/v1/namespaces/default/pods/damo-7547d9b66d-mt2hf,UID:efebe6cf-06cf-4805-abc0-8b74d7714bd9,ResourceVersion:2222612,Generation:0,CreationTimestamp:2022-08-2915:15:06 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: damo,pod-template-hash: 7547d9b66d,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: false,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661786106645608559,ALIYUN_COM_GPU_MEM_DEV: 14,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 10,},OwnerReferences:[{apps/v1 ReplicaSet damo-7547d9b66d ab9e6b18-d71f-4974-83f6-984892bfcf0f 0xc420b4922a 0xc420b4922b}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-r2dqp {nil nil nil nilnil SecretVolumeSource{SecretName:default-token-r2dqp,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{damo1 192.168.0.181:80/library/alibaba_damo_capture:4.0.0 [] []  [] [] [{APES_ADDR 192.168.0.199 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false} {damo2 nvidia/cuda:10.0-base [] []  [] [] [{APES_ADDR 192.168.0.200 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:centos7-vm,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists  NoExecute 0xc420b49420} {node.kubernetes.io/unreachable Exists  NoExecute 0xc420b49440}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:15:06 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:15:06 +0000 UTC ContainersNotReady containers with unready status: [damo1 damo2]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:15:06 +0000 UTC ContainersNotReady containers with unready status: [damo1 damo2]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:15:06 +0000 UTC  }],Message:,Reason:,HostIP:10.16.0.130,PodIP:,StartTime:2022-08-29 15:15:06 +0000 UTC,ContainerStatuses:[{damo1 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 192.168.0.181:80/library/alibaba_damo_capture:4.0.0  } {damo2 {&ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 nvidia/cuda:10.0-base  }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}
[ debug ] 2022/08/29 15:15:07 cache.go:91: Node map[centos7-vm:0xc4206d9580]
[ debug ] 2022/08/29 15:15:07 cache.go:155: GetNodeInfo() uses the existing nodeInfo for centos7-vm
[ debug ] 2022/08/29 15:15:07 nodeinfo.go:94: addOrUpdatePod() Pod damo-7547d9b66d-mt2hf in ns default with the GPU ID 0 should be added to device map
[ debug ] 2022/08/29 15:15:07 deviceinfo.go:57: dev.addPod() Pod damo-7547d9b66d-mt2hf in ns default with the GPU ID 0 will be added to device map
[ debug ] 2022/08/29 15:15:07 deviceinfo.go:64: dev.addPod() after updated is map[efebe6cf-06cf-4805-abc0-8b74d7714bd9:&Pod{ObjectMeta:k8s_io_apimachinery_pkg_apis_meta_v1.ObjectMeta{Name:damo-7547d9b66d-mt2hf,GenerateName:damo-7547d9b66d-,Namespace:default,SelfLink:/api/v1/namespaces/default/pods/damo-7547d9b66d-mt2hf,UID:efebe6cf-06cf-4805-abc0-8b74d7714bd9,ResourceVersion:2222612,Generation:0,CreationTimestamp:2022-08-29 15:15:06 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{app: damo,pod-template-hash: 7547d9b66d,},Annotations:map[string]string{ALIYUN_COM_GPU_MEM_ASSIGNED: false,ALIYUN_COM_GPU_MEM_ASSUME_TIME: 1661786106645608559,ALIYUN_COM_GPU_MEM_DEV: 14,ALIYUN_COM_GPU_MEM_IDX: 0,ALIYUN_COM_GPU_MEM_POD: 10,},OwnerReferences:[{apps/v1 ReplicaSet damo-7547d9b66d ab9e6b18-d71f-4974-83f6-984892bfcf0f 0xc420b49810 0xc420b49811}],Finalizers:[],ClusterName:,Initializers:nil,},Spec:PodSpec{Volumes:[{default-token-r2dqp {nil nil nil nil nil SecretVolumeSource{SecretName:default-token-r2dqp,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil nil}}],Containers:[{damo1 192.168.0.181:80/library/alibaba_damo_capture:4.0.0 [] []  [] [] [{APES_ADDR 192.168.0.199 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false} {damo2 nvidia/cuda:10.0-base [] []  [] [] [{APES_ADDR 192.168.0.200 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}],RestartPolicy:Always,TerminationGracePeriodSeconds:*30,ActiveDeadlineSeconds:nil,DNSPolicy:ClusterFirst,NodeSelector:map[string]string{},ServiceAccountName:default,DeprecatedServiceAccount:default,NodeName:centos7-vm,HostNetwork:false,HostPID:false,HostIPC:false,SecurityContext:&PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],},ImagePullSecrets:[],Hostname:,Subdomain:,Affinity:nil,SchedulerName:default-scheduler,InitContainers:[],AutomountServiceAccountToken:nil,Tolerations:[{node.kubernetes.io/not-ready Exists  NoExecute 0xc420b49820} {node.kubernetes.io/unreachable Exists  NoExecute 0xc420b49828}],HostAliases:[],PriorityClassName:,Priority:*0,DNSConfig:nil,ShareProcessNamespace:nil,ReadinessGates:[],},Status:PodStatus{Phase:Pending,Conditions:[{Initialized True 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:15:06 +0000 UTC  } {Ready False 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:15:06 +0000 UTC ContainersNotReady containers with unready status: [damo1 damo2]} {ContainersReady False 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:15:06 +0000 UTC ContainersNotReady containers with unready status: [damo1 damo2]} {PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:15:06 +0000 UTC  }],Message:,Reason:,HostIP:10.16.0.130,PodIP:,StartTime:2022-08-29 15:15:06 +0000 UTC,ContainerStatuses:[{damo1 {ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 192.168.0.181:80/library/alibaba_damo_capture:4.0.0  } {damo2 {&ContainerStateWaiting{Reason:ContainerCreating,Message:,} nil nil} {nil nil nil} false 0 nvidia/cuda:10.0-base  }],QOSClass:BestEffort,InitContainerStatuses:[],NominatedNodeName:,},}], and its address is 0xc4204385e0
[  info ] 2022/08/29 15:15:07 controller.go:223: end processNextWorkItem()
[  info ] 2022/08/29 15:15:08 controller.go:210: begin processNextWorkItem()
[ debug ] 2022/08/29 15:15:10 controller.go:295: No need to update pod name damo-7547d9b66d-mt2hf in ns default and old status is Pending, new status is Running; its old annotation map[ALIYUN_COM_GPU_MEM_DEV:14 ALIYUN_COM_GPU_MEM_IDX:0 ALIYUN_COM_GPU_MEM_POD:10 ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661786106645608559] and new annotation map[ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661786106645608559 ALIYUN_COM_GPU_MEM_DEV:14 ALIYUN_COM_GPU_MEM_IDX:0 ALIYUN_COM_GPU_MEM_POD:10]

gpushare-device-plugin日志

I0829 15:11:05.498287       1 allocate.go:46] ----Allocating GPU for gpu mem is started----
I0829 15:11:05.498341       1 allocate.go:57] RequestPodGPUs: 5
I0829 15:11:05.498359       1 allocate.go:61] checking...
I0829 15:11:05.511035       1 podmanager.go:112] all pod list [{{ } {damo-7547d9b66d-chqz4 damo-7547d9b66d- default /api/v1/namespaces/default/pods/damo-7547d9b66d-chqz4 7d85b004-213f-4677-8d14-43b7052d00e6 2222196 0 2022-08-29 15:11:05 +0000 UTC <nil> <nil> map[app:damo pod-template-hash:7547d9b66d] map[ALIYUN_COM_GPU_MEM_ASSIGNED:false ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661785865463628785 ALIYUN_COM_GPU_MEM_DEV:14 ALIYUN_COM_GPU_MEM_IDX:0 ALIYUN_COM_GPU_MEM_POD:10] [{apps/v1 ReplicaSet damo-7547d9b66d 204d367a-9625-4350-82c1-32bfa29efb82 0xc42039493a 0xc42039493b}] nil [] } {[{default-token-r2dqp {nil nil nil nil nil &SecretVolumeSource{SecretName:default-token-r2dqp,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nilnil nil nil nil nil nil nil nil nil nil nil nil}}] [] [{damo1 192.168.0.181:80/library/alibaba_damo_capture:4.0.0 [] []  [] [] [{APES_ADDR 192.168.0.199 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false} {damo2 nvidia/cuda:10.0-base [] []  [] [] [{APES_ADDR 192.168.0.200 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}] Always 0xc4203951f8 <nil> ClusterFirst map[] default default <nil> centos7-vm false false false <nil> &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],} []   nil default-scheduler [{node.kubernetes.io/not-ready Exists  NoExecute 0xc4203956e0} {node.kubernetes.io/unreachable Exists  NoExecute 0xc420395730}] []  0xc420395760 nil []} {Pending [{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:11:05 +0000 UTC  }]      <nil> [] [] BestEffort}}]
I0829 15:11:05.511370       1 podmanager.go:123] list pod damo-7547d9b66d-chqz4 in ns default in node centos7-vm and status is Pending
I0829 15:11:05.511399       1 podutils.go:91] Found GPUSharedAssumed assumed pod damo-7547d9b66d-chqz4 in namespace default.
I0829 15:11:05.511413       1 podmanager.go:157] candidate pod damo-7547d9b66d-chqz4 in ns default with timestamp 1661785865463628785 is found.
I0829 15:11:05.511429       1 allocate.go:70] Pod damo-7547d9b66d-chqz4 in ns default request GPU Memory 10 with timestamp 1661785865463628785
W0829 15:11:05.511447       1 allocate.go:152] invalid allocation requst: request GPU memory 5 can't be satisfied.
I0829 15:11:05.526875       1 allocate.go:46] ----Allocating GPU for gpu mem is started----
I0829 15:11:05.526914       1 allocate.go:57] RequestPodGPUs: 5
I0829 15:11:05.526928       1 allocate.go:61] checking...
I0829 15:11:05.532196       1 podmanager.go:112] all pod list [{{ } {damo-7547d9b66d-chqz4 damo-7547d9b66d- default /api/v1/namespaces/default/pods/damo-7547d9b66d-chqz4 7d85b004-213f-4677-8d14-43b7052d00e6 2222196 0 2022-08-29 15:11:05 +0000 UTC <nil> <nil> map[app:damo pod-template-hash:7547d9b66d] map[ALIYUN_COM_GPU_MEM_ASSUME_TIME:1661785865463628785 ALIYUN_COM_GPU_MEM_DEV:14 ALIYUN_COM_GPU_MEM_IDX:0 ALIYUN_COM_GPU_MEM_POD:10 ALIYUN_COM_GPU_MEM_ASSIGNED:false] [{apps/v1 ReplicaSet damo-7547d9b66d 204d367a-9625-4350-82c1-32bfa29efb82 0xc42004671a 0xc42004671b}] nil [] } {[{default-token-r2dqp {nil nil nil nil nil &SecretVolumeSource{SecretName:default-token-r2dqp,Items:[],DefaultMode:*420,Optional:nil,} nil nil nil nil nil nil nil nil nilnil nil nil nil nil nil nil nil nil nil nil nil}}] [] [{damo1 192.168.0.181:80/library/alibaba_damo_capture:4.0.0 [] []  [] [] [{APES_ADDR 192.168.0.199 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false} {damo2 nvidia/cuda:10.0-base [] []  [] [] [{APES_ADDR 192.168.0.200 nil} {ALGO_TYPE video nil} {NVIDIA_DRIVER_CAPABILITIES compute,utility,video nil}] {map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}] map[aliyun.com/gpu-mem:{{5 0} {<nil>} 5 DecimalSI}]} [{default-token-r2dqp true /var/run/secrets/kubernetes.io/serviceaccount  <nil>}] [] nil nil nil /dev/termination-log File IfNotPresent nil false false false}] Always 0xc420046af8 <nil> ClusterFirst map[] default default <nil> centos7-vm false false false <nil> &PodSecurityContext{SELinuxOptions:nil,RunAsUser:nil,RunAsNonRoot:nil,SupplementalGroups:[],FSGroup:nil,RunAsGroup:nil,Sysctls:[],} []   nil default-scheduler [{node.kubernetes.io/not-ready Exists  NoExecute 0xc420046c50} {node.kubernetes.io/unreachable Exists  NoExecute 0xc420046cc0}] []  0xc420046cd0 nil []} {Pending [{PodScheduled True 0001-01-01 00:00:00 +0000 UTC 2022-08-29 15:11:05 +0000 UTC  }]      <nil> [] [] BestEffort}}]
I0829 15:11:05.532587       1 podmanager.go:123] list pod damo-7547d9b66d-chqz4 in ns default in node centos7-vm and status is Pending
I0829 15:11:05.532612       1 podutils.go:91] Found GPUSharedAssumed assumed pod damo-7547d9b66d-chqz4 in namespace default.
I0829 15:11:05.532628       1 podmanager.go:157] candidate pod damo-7547d9b66d-chqz4 in ns default with timestamp 1661785865463628785 is found.
I0829 15:11:05.532644       1 allocate.go:70] Pod damo-7547d9b66d-chqz4 in ns default request GPU Memory 10 with timestamp 1661785865463628785
W0829 15:11:05.532664       1 allocate.go:152] invalid allocation requst: request GPU memory 5 can't be satisfied.
freelizhun commented 10 months ago

这个项目的gpushare-device-plugin的逻辑本身就不支持一个pod中2个container的这种情况,具体可以看代码