Project-HAMi / HAMi

Heterogeneous AI Computing Virtualization Middleware
http://project-hami.io/
Apache License 2.0
962 stars 199 forks source link

调度器资源占用统计准确性问题,可能引发因资源不足而调度失败。 #594

Closed PhilixHe closed 2 weeks ago

PhilixHe commented 2 weeks ago

Please provide an in-depth description of the question you have:

  1. 调整Deployment 的GPU资源大小,当旧Pod被删除,hami-scheduler不会释放其占用vGPU资源,从而可能会引发因资源不足而调度失败。
  2. 分别从scheduler和device-plugin获取的metrics,其中scheduler统计的pod(hefei-ollama-6bdd9654bf-ch5cc)在deployment更新后已经被删除,但是数据中任然存在,device-plugin的数据正常。 WechatIMG62

What do you think about this question?:

Environment:

Nimbus318 commented 2 weeks ago

调整 Deployment 的 GPU 资源大小

可能会引发因资源不足而调度失败

  1. 提供一下调整资源前后的 Deployment 声明
  2. 调度失败的 Event 是什么(describe pod)
  3. 可能是什么意思,偶现调度失败? 还是 Pending 一段时间后 Running
PhilixHe commented 2 weeks ago

调整 Deployment 的 GPU 资源大小

可能会引发因资源不足而调度失败

  1. 提供一下调整资源前后的 Deployment 声明
  2. 调度失败的 Event 是什么(describe pod)
  3. 可能是什么意思,偶现调度失败? 还是 Pending 一段时间后 Running
  1. 调度失败是因为这里部署的hami-scheduler所掌握的GPU资源统计不准确导致的。例如:以下Deployment被部署调度成功后,curl {scheduler node ip}:31993/metrics得到的信息为:GPUDeviceSharedNum=1 ,当编辑调整Deployment GPU申明大小时(或者原地删除Pod),旧Pod会被删除(Terminating结束并且执行kubectl get已经查不到),新Pod会根据更新后的模版创建调度运行,此时再次执行curl {scheduler node ip}:31993/metrics 会发现Metrics信息中,GPUDeviceSharedNum=2vGPUCorePercentage gaugevGPUMemoryPercentage gaugevGPUPodsDeviceAllocated gauge 中除了包含有新Pod信息外,旧Pod也没被删除,也就是旧Pod资源没被hami-scheduler释放。
  2. 部署的Deployment:
    apiVersion: apps/v1
    kind: Deployment
    metadata:
    annotations:
    deployment.kubernetes.io/revision: "2"
    kenc.ksyun.com/username: philix_he
    ovn.kubernetes.io/limit_rate: "8000"
    ovn.kubernetes.io/limit_rate_kind: pod
    creationTimestamp: "2024-11-07T02:34:34Z"
    generation: 3
    labels:
    app: hefei-test1
    kenc.ksyun.com/app: hefei-test1
    kenc.ksyun.com/calculation: container
    name: hefei-test1
    namespace: default-2000174887
    resourceVersion: "10478859"
    uid: de1d4b6b-0361-4ef3-a047-eebbae8c4574
    spec:
    progressDeadlineSeconds: 600
    replicas: 1
    revisionHistoryLimit: 10
    selector:
    matchLabels:
      app: hefei-test1
    strategy:
    rollingUpdate:
      maxSurge: 20%
      maxUnavailable: 1
    type: RollingUpdate
    template:
    metadata:
      annotations:
        kenc.ksyun.com/controller-kind: deployment
      creationTimestamp: null
      labels:
        app: hefei-test1
        kenc.ksyun.com/app: hefei-test1
        kenc.ksyun.com/calculation: container
    spec:
      containers:
      - env:
        - name: REGION
          value: jxmp37
        image: registry.kenc.com/kenc/nginx:1.16.0
        imagePullPolicy: IfNotPresent
        name: hefei-test1
        resources:
          limits:
            cpu: "2"
            memory: 2Gi
            nvidia.com/gpu: "1"
            nvidia.com/gpumem: "2048"
          requests:
            cpu: "1"
            memory: 2Gi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: image-secret
      priorityClassName: client-normal-priority
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
    status:
    availableReplicas: 1
    conditions:
    - lastTransitionTime: "2024-11-07T02:34:34Z"
    lastUpdateTime: "2024-11-07T02:34:34Z"
    message: Deployment has minimum availability.
    reason: MinimumReplicasAvailable
    status: "True"
    type: Available
    - lastTransitionTime: "2024-11-07T02:34:34Z"
    lastUpdateTime: "2024-11-07T02:37:15Z"
    message: ReplicaSet "hefei-test1-f97ddd94b" has successfully progressed.
    reason: NewReplicaSetAvailable
    status: "True"
    type: Progressing
    observedGeneration: 3
    readyReplicas: 1
    replicas: 1
    updatedReplicas: 1

    GPU节点Metrics(获取了两次节点Metrics,包含了Pod被删除前后的指标统计)

[root@cdnkencjxmp35-edge11 ~]# curl 10.35.211.17:31993/metrics
# HELP GPUDeviceCoreAllocated Device core allocated for a certain GPU
# TYPE GPUDeviceCoreAllocated gauge
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0
# HELP GPUDeviceCoreLimit Device memory core limit for a certain GPU
# TYPE GPUDeviceCoreLimit gauge
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 100
# HELP GPUDeviceMemoryAllocated Device memory allocated for a certain GPU
# TYPE GPUDeviceMemoryAllocated gauge
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2.147483648e+09
# HELP GPUDeviceMemoryLimit Device memory limit for a certain GPU
# TYPE GPUDeviceMemoryLimit gauge
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2.5769803776e+10
# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 1
# HELP nodeGPUMemoryPercentage GPU Memory Allocated Percentage on a certain GPU
# TYPE nodeGPUMemoryPercentage gauge
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0.08333333333333333
# HELP nodeGPUOverview GPU overview on a certain node
# TYPE nodeGPUOverview gauge
nodeGPUOverview{devicecores="0",deviceidx="0",devicememorylimit="24576",devicetype="NVIDIA-NVIDIA GeForce RTX 3090",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",sharedcontainers="1",zone="vGPU"} 2.147483648e+09
# HELP vGPUCorePercentage vGPU core allocated from a container
# TYPE vGPUCorePercentage gauge
vGPUCorePercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0
# HELP vGPUMemoryPercentage vGPU memory percentage allocated from a container
# TYPE vGPUMemoryPercentage gauge
vGPUMemoryPercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0.08333333333333333
# HELP vGPUPodsDeviceAllocated vGPU Allocated from pods
# TYPE vGPUPodsDeviceAllocated gauge
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 2.147483648e+09
[root@cdnkencjxmp35-edge11 ~]#
[root@cdnkencjxmp35-edge11 ~]#
[root@cdnkencjxmp35-edge11 ~]# 
[root@cdnkencjxmp35-edge11 ~]# curl 10.35.211.17:31993/metrics
# HELP GPUDeviceCoreAllocated Device core allocated for a certain GPU
# TYPE GPUDeviceCoreAllocated gauge
GPUDeviceCoreAllocated{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0
# HELP GPUDeviceCoreLimit Device memory core limit for a certain GPU
# TYPE GPUDeviceCoreLimit gauge
GPUDeviceCoreLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 100
# HELP GPUDeviceMemoryAllocated Device memory allocated for a certain GPU
# TYPE GPUDeviceMemoryAllocated gauge
GPUDeviceMemoryAllocated{devicecores="0",deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 4.294967296e+09
# HELP GPUDeviceMemoryLimit Device memory limit for a certain GPU
# TYPE GPUDeviceMemoryLimit gauge
GPUDeviceMemoryLimit{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2.5769803776e+10
# HELP GPUDeviceSharedNum Number of containers sharing this GPU
# TYPE GPUDeviceSharedNum gauge
GPUDeviceSharedNum{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 2
# HELP nodeGPUMemoryPercentage GPU Memory Allocated Percentage on a certain GPU
# TYPE nodeGPUMemoryPercentage gauge
nodeGPUMemoryPercentage{deviceidx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",zone="vGPU"} 0.16666666666666666
# HELP nodeGPUOverview GPU overview on a certain node
# TYPE nodeGPUOverview gauge
nodeGPUOverview{devicecores="0",deviceidx="0",devicememorylimit="24576",devicetype="NVIDIA-NVIDIA GeForce RTX 3090",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodeid="10.35.211.17",sharedcontainers="2",zone="vGPU"} 4.294967296e+09
# HELP vGPUCorePercentage vGPU core allocated from a container
# TYPE vGPUCorePercentage gauge
vGPUCorePercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0
vGPUCorePercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-vmzkx",podnamespace="default-2000174887",zone="vGPU"} 0
# HELP vGPUMemoryPercentage vGPU memory percentage allocated from a container
# TYPE vGPUMemoryPercentage gauge
vGPUMemoryPercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 0.08333333333333333
vGPUMemoryPercentage{containeridx="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-vmzkx",podnamespace="default-2000174887",zone="vGPU"} 0.08333333333333333
# HELP vGPUPodsDeviceAllocated vGPU Allocated from pods
# TYPE vGPUPodsDeviceAllocated gauge
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-rhh9t",podnamespace="default-2000174887",zone="vGPU"} 2.147483648e+09
vGPUPodsDeviceAllocated{containeridx="0",deviceusedcore="0",deviceuuid="GPU-bfc719f8-e2ae-9163-4d8e-79f5d1519806",nodename="10.35.211.17",podname="hefei-test1-f97ddd94b-vmzkx",podnamespace="default-2000174887",zone="vGPU"} 2.147483648e+09
Nimbus318 commented 2 weeks ago

【进展记录】

通过单聊排查:

  1. scheduler 根据申请的 GPU 资源来进行节点打分过滤,最后成功分配了节点和显卡,并且把分配结果 patch 到 annotation 了
  2. dp 也 decode annotation 成功了,Allocate 相关的操作,释放节点锁
  3. 进入容器执行 nvidia-smi 显存符合预期
  4. 以上整个流程都通过 scheduler 和 dp 日志确认了,都是顺利的

但是 Pod Running 后没有任何 hami 相关的 annotation 了,目前怀疑是和集群内的其他组件冲突了

PhilixHe commented 2 weeks ago

【进展记录】

通过单聊排查:

  1. scheduler 根据申请的 GPU 资源来进行节点打分过滤,最后成功分配了节点和显卡,并且把分配结果 patch 到 annotation 了
  2. dp 也 decode annotation 成功了,Allocate 相关的操作,释放节点锁
  3. 进入容器执行 nvidia-smi 显存符合预期
  4. 以上整个流程都通过 scheduler 和 dp 日志确认了,都是顺利的

但是 Pod Running 后没有任何 hami 相关的 annotation 了,目前怀疑是和集群内的其他组件冲突了

上述问题已经得到解决。🎉

原因与 @Nimbus318 所猜想一致,由于集群中的其他组件在GPU Pod创建过程中也参与修改annotations导致HAMi组件的注解被覆盖,最终导致删除GPU Pod时GPU资源配额没有被hami-scheduler回收。

本次问题能得到定位并且解决,非常感谢社区小伙伴 @Nimbus318 的耐心帮助,这里再次表示感谢!~