AliyunContainerService / gpushare-scheduler-extender

GPU Sharing Scheduler for Kubernetes Cluster
Apache License 2.0
1.36k stars 303 forks source link

多次进行删除创建Pod之后,会导致新创建Pod出现Pending状态 #201

Open liufangpeng opened 1 year ago

liufangpeng commented 1 year ago

第一次部署的时候可以正常创建,多次进行delete\create同一个Pod之后出现异常

使用命令:kubectl -n test-testgpu get event

LAST SEEN TYPE REASON OBJECT MESSAGE 3m4s Warning FailedScheduling pod/binpack-3-7b8684575d-cqntk 0/1 nodes are available: 1 Insufficient GPU Memory in one device.

使用命令:nvidia-smi

Wed Feb 15 14:57:18 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A40-4Q On | 00000000:02:00.0 Off | 0 | | N/A N/A P0 N/A / N/A | 0MiB / 4096MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

使用命令:kubectl -n kube-system logs -f gpushare-schd-extender-594b9bc6d6-lh8w9

[ debug ] 2023/02/15 06:58:39 routes.go:162: /gpushare-scheduler/filter response=&{0xc42047e1e0 0xc420548300 0xc420355b80 0x565b70 true false false false 0xc4200aa580 {0xc42037a1c0 map[Content-Type:[application/json]] false false} map[Content-Type:[application/json]] true 111 -1 200 false false [] 0 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [0 0 0 0 0 0 0 0 0 0] [0 0 0] 0xc420348070 0} [ debug ] 2023/02/15 06:58:58 controller.go:295: No need to update pod name binpack-3-7b8684575d-9ksk4 in ns test-testgpu and old status is Pending, new status is Pending; its old annotation map[ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair ovn.kubernetes.io/gateway:10.183.0.1 ovn.kubernetes.io/logical_router:ovn-cluster kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113 cpaas.io/creator:admin kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3] and new annotation map[ovn.kubernetes.io/logical_router:ovn-cluster ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair ovn.kubernetes.io/gateway:10.183.0.1 kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3 kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113 cpaas.io/creator:admin] [ debug ] 2023/02/15 06:59:28 controller.go:295: No need to update pod name binpack-3-7b8684575d-9ksk4 in ns test-testgpu and old status is Pending, new status is Pending; its old annotation map[ovn.kubernetes.io/logical_router:ovn-cluster ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair ovn.kubernetes.io/gateway:10.183.0.1 kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3 kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113 cpaas.io/creator:admin] and new annotation map[ovn.kubernetes.io/gateway:10.183.0.1 ovn.kubernetes.io/logical_router:ovn-cluster ovn.kubernetes.io/logical_switch:ovn-default ovn.kubernetes.io/mac_address:00:00:00:B9:31:56 ovn.kubernetes.io/network_type:geneve ovn.kubernetes.io/pod_nic_type:veth-pair cpaas.io/creator:admin kubernetes.io/limit-ranger:LimitRanger plugin set: cpu, memory request for container binpack-3; cpu, memory limit for container binpack-3 kubernetes.io/psp:20-user-restricted ovn.kubernetes.io/allocated:true ovn.kubernetes.io/cidr:10.183.0.0/16 ovn.kubernetes.io/ip_address:10.183.0.113]

后续重新创建了gpushare-scheduler-extender就可以继续正常创建了,但是重复创建几次Pod又Pending 目前没有找到具体什么原因