调度器资源占用统计准确性问题，可能引发因资源不足而调度失败。

PhilixHe commented 2 weeks ago

Please provide an in-depth description of the question you have:

调整Deployment 的GPU资源大小，当旧Pod被删除，hami-scheduler不会释放其占用vGPU资源，从而可能会引发因资源不足而调度失败。
分别从scheduler和device-plugin获取的metrics，其中scheduler统计的pod(hefei-ollama-6bdd9654bf-ch5cc)在deployment更新后已经被删除，但是数据中任然存在，device-plugin的数据正常。

What do you think about this question?:

Environment:

HAMi version: v2.4.0
Kubernetes version: v1.30.0
Others: GPU 节点 1， RTX3090 24G 1

Nimbus318 commented 2 weeks ago

调整 Deployment 的 GPU 资源大小

可能会引发因资源不足而调度失败

提供一下调整资源前后的 Deployment 声明
调度失败的 Event 是什么（describe pod）
可能是什么意思，偶现调度失败? 还是 Pending 一段时间后 Running

PhilixHe commented 2 weeks ago

调整 Deployment 的 GPU 资源大小

可能会引发因资源不足而调度失败

提供一下调整资源前后的 Deployment 声明

调度失败的 Event 是什么（describe pod）

可能是什么意思，偶现调度失败? 还是 Pending 一段时间后 Running

调度失败是因为这里部署的hami-scheduler所掌握的GPU资源统计不准确导致的。例如：以下Deployment被部署调度成功后，curl {scheduler node ip}:31993/metrics得到的信息为：GPUDeviceSharedNum=1 ，当编辑调整Deployment GPU申明大小时（或者原地删除Pod），旧Pod会被删除（Terminating结束并且执行kubectl get已经查不到），新Pod会根据更新后的模版创建调度运行，此时再次执行curl {scheduler node ip}:31993/metrics 会发现Metrics信息中，GPUDeviceSharedNum=2，vGPUCorePercentage gauge 、vGPUMemoryPercentage gauge、vGPUPodsDeviceAllocated gauge 中除了包含有新Pod信息外，旧Pod也没被删除，也就是旧Pod资源没被hami-scheduler释放。

部署的Deployment:

apiVersion: apps/v1
kind: Deployment
metadata:
annotations:
deployment.kubernetes.io/revision: "2"
kenc.ksyun.com/username: philix_he
ovn.kubernetes.io/limit_rate: "8000"
ovn.kubernetes.io/limit_rate_kind: pod
creationTimestamp: "2024-11-07T02:34:34Z"
generation: 3
labels:
app: hefei-test1
kenc.ksyun.com/app: hefei-test1
kenc.ksyun.com/calculation: container
name: hefei-test1
namespace: default-2000174887
resourceVersion: "10478859"
uid: de1d4b6b-0361-4ef3-a047-eebbae8c4574
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
  app: hefei-test1
strategy:
rollingUpdate:
  maxSurge: 20%
  maxUnavailable: 1
type: RollingUpdate
template:
metadata:
  annotations:
    kenc.ksyun.com/controller-kind: deployment
  creationTimestamp: null
  labels:
    app: hefei-test1
    kenc.ksyun.com/app: hefei-test1
    kenc.ksyun.com/calculation: container
spec:
  containers:
  - env:
    - name: REGION
      value: jxmp37
    image: registry.kenc.com/kenc/nginx:1.16.0
    imagePullPolicy: IfNotPresent
    name: hefei-test1
    resources:
      limits:
        cpu: "2"
        memory: 2Gi
        nvidia.com/gpu: "1"
        nvidia.com/gpumem: "2048"
      requests:
        cpu: "1"
        memory: 2Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
  dnsPolicy: ClusterFirst
  imagePullSecrets:
  - name: image-secret
  priorityClassName: client-normal-priority
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  terminationGracePeriodSeconds: 30
status:
availableReplicas: 1
conditions:
- lastTransitionTime: "2024-11-07T02:34:34Z"
lastUpdateTime: "2024-11-07T02:34:34Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2024-11-07T02:34:34Z"
lastUpdateTime: "2024-11-07T02:37:15Z"
message: ReplicaSet "hefei-test1-f97ddd94b" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
observedGeneration: 3
readyReplicas: 1
replicas: 1
updatedReplicas: 1

GPU节点Metrics（获取了两次节点Metrics，包含了Pod被删除前后的指标统计）

[root@cdnkencjxmp35-edge11 ~]# curl 10.35.211.17:31993/metrics
# HELP GPUDeviceCoreAllocated Device core allocated for a certain GPU
# TYPE GPUDeviceCoreAllocated gauge
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0
# HELP GPUDeviceCoreLimit Device memory core limit for a certain GPU
# TYPE GPUDeviceCoreLimit gauge
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 100
# HELP GPUDeviceMemoryAllocated Device memory allocated for a certain GPU
# TYPE GPUDeviceMemoryAllocated gauge
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2.147483648e+09
# HELP GPUDeviceMemoryLimit Device memory limit for a certain GPU
# TYPE GPUDeviceMemoryLimit gauge
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2.5769803776e+10
# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 1
# HELP nodeGPUMemoryPercentage GPU Memory Allocated Percentage on a certain GPU
# TYPE nodeGPUMemoryPercentage gauge
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0.08333333333333333
# HELP nodeGPUOverview GPU overview on a certain node
# TYPE nodeGPUOverview gauge
nodeGPUOverview{devicecores="0",deviceidx="0",devicememorylimit="24576",devicetype="NVIDIA-NVIDIA GeForce RTX 3090",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",sharedcontainers="1",zone="vGPU"} 2.147483648e+09
# HELP vGPUCorePercentage vGPU core allocated from a container
# TYPE vGPUCorePercentage gauge
vGPUCorePercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0
# HELP vGPUMemoryPercentage vGPU memory percentage allocated from a container
# TYPE vGPUMemoryPercentage gauge
vGPUMemoryPercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0.08333333333333333
# HELP vGPUPodsDeviceAllocated vGPU Allocated from pods
# TYPE vGPUPodsDeviceAllocated gauge
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 2.147483648e+09
[root@cdnkencjxmp35-edge11 ~]#
[root@cdnkencjxmp35-edge11 ~]#
[root@cdnkencjxmp35-edge11 ~]# 
[root@cdnkencjxmp35-edge11 ~]# curl 10.35.211.17:31993/metrics
# HELP GPUDeviceCoreAllocated Device core allocated for a certain GPU
# TYPE GPUDeviceCoreAllocated gauge
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0
# HELP GPUDeviceCoreLimit Device memory core limit for a certain GPU
# TYPE GPUDeviceCoreLimit gauge
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 100
# HELP GPUDeviceMemoryAllocated Device memory allocated for a certain GPU
# TYPE GPUDeviceMemoryAllocated gauge
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 4.294967296e+09
# HELP GPUDeviceMemoryLimit Device memory limit for a certain GPU
# TYPE GPUDeviceMemoryLimit gauge
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2.5769803776e+10
# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2
# HELP nodeGPUMemoryPercentage GPU Memory Allocated Percentage on a certain GPU
# TYPE nodeGPUMemoryPercentage gauge
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0.16666666666666666
# HELP nodeGPUOverview GPU overview on a certain node
# TYPE nodeGPUOverview gauge
nodeGPUOverview{devicecores="0",deviceidx="0",devicememorylimit="24576",devicetype="NVIDIA-NVIDIA GeForce RTX 3090",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",sharedcontainers="2",zone="vGPU"} 4.294967296e+09
# HELP vGPUCorePercentage vGPU core allocated from a container
# TYPE vGPUCorePercentage gauge
vGPUCorePercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0
vGPUCorePercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-vmzkx",podnamespace="default-2000174887",zone="vGPU"} 0
# HELP vGPUMemoryPercentage vGPU memory percentage allocated from a container
# TYPE vGPUMemoryPercentage gauge
vGPUMemoryPercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0.08333333333333333
vGPUMemoryPercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-vmzkx",podnamespace="default-2000174887",zone="vGPU"} 0.08333333333333333
# HELP vGPUPodsDeviceAllocated vGPU Allocated from pods
# TYPE vGPUPodsDeviceAllocated gauge
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 2.147483648e+09
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-vmzkx",podnamespace="default-2000174887",zone="vGPU"} 2.147483648e+09

Nimbus318 commented 2 weeks ago

【进展记录】

通过单聊排查:

scheduler 根据申请的 GPU 资源来进行节点打分过滤，最后成功分配了节点和显卡，并且把分配结果 patch 到 annotation 了
dp 也 decode annotation 成功了，Allocate 相关的操作，释放节点锁
进入容器执行 nvidia-smi 显存符合预期
以上整个流程都通过 scheduler 和 dp 日志确认了，都是顺利的

但是 Pod Running 后没有任何 hami 相关的 annotation 了，目前怀疑是和集群内的其他组件冲突了

PhilixHe commented 2 weeks ago

【进展记录】

通过单聊排查:

scheduler 根据申请的 GPU 资源来进行节点打分过滤，最后成功分配了节点和显卡，并且把分配结果 patch 到 annotation 了

dp 也 decode annotation 成功了，Allocate 相关的操作，释放节点锁

进入容器执行 nvidia-smi 显存符合预期

以上整个流程都通过 scheduler 和 dp 日志确认了，都是顺利的

但是 Pod Running 后没有任何 hami 相关的 annotation 了，目前怀疑是和集群内的其他组件冲突了

上述问题已经得到解决。🎉

原因与 @Nimbus318 所猜想一致，由于集群中的其他组件在GPU Pod创建过程中也参与修改annotations导致HAMi组件的注解被覆盖，最终导致删除GPU Pod时GPU资源配额没有被hami-scheduler回收。

本次问题能得到定位并且解决，非常感谢社区小伙伴 @Nimbus318 的耐心帮助，这里再次表示感谢！～

Project-HAMi / HAMi

调度器资源占用统计准确性问题，可能引发因资源不足而调度失败。 #594

【进展记录】

【进展记录】