Closed PhilixHe closed 2 weeks ago
调整 Deployment 的 GPU 资源大小
可能
会引发因资源不足而调度失败
可能
是什么意思,偶现调度失败? 还是 Pending 一段时间后 Running调整 Deployment 的 GPU 资源大小
可能
会引发因资源不足而调度失败
- 提供一下调整资源前后的 Deployment 声明
- 调度失败的 Event 是什么(describe pod)
可能
是什么意思,偶现调度失败? 还是 Pending 一段时间后 Running
调度失败
是因为这里部署的hami-scheduler
所掌握的GPU资源统计不准确导致的。例如:以下Deployment被部署调度成功后,curl {scheduler node ip}:31993/metrics
得到的信息为:GPUDeviceSharedNum=1
,当编辑调整Deployment GPU申明大小时(或者原地删除Pod),旧Pod
会被删除(Terminating结束并且执行kubectl get
已经查不到),新Pod
会根据更新后的模版创建调度运行,此时再次执行curl {scheduler node ip}:31993/metrics
会发现Metrics信息中,GPUDeviceSharedNum=2
,vGPUCorePercentage gauge
、vGPUMemoryPercentage gauge
、vGPUPodsDeviceAllocated gauge
中除了包含有新Pod
信息外,旧Pod
也没被删除,也就是旧Pod
资源没被hami-scheduler
释放。apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
kenc.ksyun.com/username: philix_he
ovn.kubernetes.io/limit_rate: "8000"
ovn.kubernetes.io/limit_rate_kind: pod
creationTimestamp: "2024-11-07T02:34:34Z"
generation: 3
labels:
app: hefei-test1
kenc.ksyun.com/app: hefei-test1
kenc.ksyun.com/calculation: container
name: hefei-test1
namespace: default-2000174887
resourceVersion: "10478859"
uid: de1d4b6b-0361-4ef3-a047-eebbae8c4574
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
app: hefei-test1
strategy:
rollingUpdate:
maxSurge: 20%
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
annotations:
kenc.ksyun.com/controller-kind: deployment
creationTimestamp: null
labels:
app: hefei-test1
kenc.ksyun.com/app: hefei-test1
kenc.ksyun.com/calculation: container
spec:
containers:
- env:
- name: REGION
value: jxmp37
image: registry.kenc.com/kenc/nginx:1.16.0
imagePullPolicy: IfNotPresent
name: hefei-test1
resources:
limits:
cpu: "2"
memory: 2Gi
nvidia.com/gpu: "1"
nvidia.com/gpumem: "2048"
requests:
cpu: "1"
memory: 2Gi
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: image-secret
priorityClassName: client-normal-priority
restartPolicy: Always
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2024-11-07T02:34:34Z"
lastUpdateTime: "2024-11-07T02:34:34Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2024-11-07T02:34:34Z"
lastUpdateTime: "2024-11-07T02:37:15Z"
message: ReplicaSet "hefei-test1-f97ddd94b" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 3
readyReplicas: 1
replicas: 1
updatedReplicas: 1
GPU节点Metrics(获取了两次节点Metrics,包含了Pod被删除前后的指标统计)
[root@cdnkencjxmp35-edge11 ~]# curl 10.35.211.17:31993/metrics
# HELP GPUDeviceCoreAllocated Device core allocated for a certain GPU
# TYPE GPUDeviceCoreAllocated gauge
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0
# HELP GPUDeviceCoreLimit Device memory core limit for a certain GPU
# TYPE GPUDeviceCoreLimit gauge
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 100
# HELP GPUDeviceMemoryAllocated Device memory allocated for a certain GPU
# TYPE GPUDeviceMemoryAllocated gauge
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2.147483648e+09
# HELP GPUDeviceMemoryLimit Device memory limit for a certain GPU
# TYPE GPUDeviceMemoryLimit gauge
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2.5769803776e+10
# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 1
# HELP nodeGPUMemoryPercentage GPU Memory Allocated Percentage on a certain GPU
# TYPE nodeGPUMemoryPercentage gauge
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0.08333333333333333
# HELP nodeGPUOverview GPU overview on a certain node
# TYPE nodeGPUOverview gauge
nodeGPUOverview{devicecores="0",deviceidx="0",devicememorylimit="24576",devicetype="NVIDIA-NVIDIA GeForce RTX 3090",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",sharedcontainers="1",zone="vGPU"} 2.147483648e+09
# HELP vGPUCorePercentage vGPU core allocated from a container
# TYPE vGPUCorePercentage gauge
vGPUCorePercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0
# HELP vGPUMemoryPercentage vGPU memory percentage allocated from a container
# TYPE vGPUMemoryPercentage gauge
vGPUMemoryPercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0.08333333333333333
# HELP vGPUPodsDeviceAllocated vGPU Allocated from pods
# TYPE vGPUPodsDeviceAllocated gauge
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 2.147483648e+09
[root@cdnkencjxmp35-edge11 ~]#
[root@cdnkencjxmp35-edge11 ~]#
[root@cdnkencjxmp35-edge11 ~]#
[root@cdnkencjxmp35-edge11 ~]# curl 10.35.211.17:31993/metrics
# HELP GPUDeviceCoreAllocated Device core allocated for a certain GPU
# TYPE GPUDeviceCoreAllocated gauge
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0
# HELP GPUDeviceCoreLimit Device memory core limit for a certain GPU
# TYPE GPUDeviceCoreLimit gauge
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 100
# HELP GPUDeviceMemoryAllocated Device memory allocated for a certain GPU
# TYPE GPUDeviceMemoryAllocated gauge
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 4.294967296e+09
# HELP GPUDeviceMemoryLimit Device memory limit for a certain GPU
# TYPE GPUDeviceMemoryLimit gauge
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2.5769803776e+10
# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2
# HELP nodeGPUMemoryPercentage GPU Memory Allocated Percentage on a certain GPU
# TYPE nodeGPUMemoryPercentage gauge
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0.16666666666666666
# HELP nodeGPUOverview GPU overview on a certain node
# TYPE nodeGPUOverview gauge
nodeGPUOverview{devicecores="0",deviceidx="0",devicememorylimit="24576",devicetype="NVIDIA-NVIDIA GeForce RTX 3090",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",sharedcontainers="2",zone="vGPU"} 4.294967296e+09
# HELP vGPUCorePercentage vGPU core allocated from a container
# TYPE vGPUCorePercentage gauge
vGPUCorePercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0
vGPUCorePercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-vmzkx",podnamespace="default-2000174887",zone="vGPU"} 0
# HELP vGPUMemoryPercentage vGPU memory percentage allocated from a container
# TYPE vGPUMemoryPercentage gauge
vGPUMemoryPercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0.08333333333333333
vGPUMemoryPercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-vmzkx",podnamespace="default-2000174887",zone="vGPU"} 0.08333333333333333
# HELP vGPUPodsDeviceAllocated vGPU Allocated from pods
# TYPE vGPUPodsDeviceAllocated gauge
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 2.147483648e+09
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-vmzkx",podnamespace="default-2000174887",zone="vGPU"} 2.147483648e+09
通过单聊排查:
但是 Pod Running 后没有任何 hami 相关的 annotation 了,目前怀疑是和集群内的其他组件冲突了
【进展记录】
通过单聊排查:
- scheduler 根据申请的 GPU 资源来进行节点打分过滤,最后成功分配了节点和显卡,并且把分配结果 patch 到 annotation 了
- dp 也 decode annotation 成功了,Allocate 相关的操作,释放节点锁
- 进入容器执行 nvidia-smi 显存符合预期
- 以上整个流程都通过 scheduler 和 dp 日志确认了,都是顺利的
但是 Pod Running 后没有任何 hami 相关的 annotation 了,目前怀疑是和集群内的其他组件冲突了
上述问题已经得到解决。🎉
原因与 @Nimbus318 所猜想一致,由于集群中的其他组件在GPU Pod创建过程中也参与修改annotations导致HAMi组件的注解被覆盖,最终导致删除GPU Pod时GPU资源配额没有被hami-scheduler回收。
本次问题能得到定位并且解决,非常感谢社区小伙伴 @Nimbus318 的耐心帮助,这里再次表示感谢!~
Please provide an in-depth description of the question you have:
What do you think about this question?:
Environment: