aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.56k stars 909 forks source link

GPU workload pod failed to schedule due to low-priority gpu placeholder daemonset #6507

Closed WxFang closed 1 month ago

WxFang commented 1 month ago

Description

Observed Behavior: We're trying to migrate gpu nodepool (p5.48xlarge) to karpenter. But pods kept pending and I can see scheduling events. Basically we should have 6 daemonset pods consuming very little cpu/ram only. And each workload pod requests 8 gpu which means one node can only have 7 pods in total. But in the event log, you can see workload pod is some double counted.

daemonset overhead={"cpu":"191250m","memory":"1995268435456","nvidia.com/gpu":"8","pods":"7"},
no instance type satisfied resources 
{"cpu":"271250m","memory":"1995268437456","nvidia.com/gpu":"16","pods":"8"} and 
requirements karpenter.sh/capacity-type In [on-demand], 
karpenter.sh/nodepool In [gpu-p5-48-llm], 
kubernetes.io/arch In [amd64], 
kubernetes.io/os In [linux], 
node.kubernetes.io/instance-type In [p5.48xlarge], 
nodepool In [gpu-p5-48-llm], 
nvidia.com/gpu In [true] (no instance type has enough resources)

Expected Behavior: daemonset overhead should not include 8 gpu and pod count should be 6.

Reproduction Steps (Please include YAML):

    resources:
      limits:
        nvidia.com/gpu: "8"
      requests:
        cpu: "80"
        memory: 2k
        nvidia.com/gpu: "8"

spec:
  disruption:
    budgets:
    - nodes: 10%
    consolidateAfter: Never
    consolidationPolicy: WhenUnderutilized
    expireAfter: Never
  limits:
    cpu: "1344"
    nvidia.com/gpu: "56"
  template:
    metadata:
      annotations:
        karpenter.sh/do-not-disrupt: "true"
    spec:
      nodeClassRef:
        name: gpu-p5-48-llm
      requirements:
      - key: node.kubernetes.io/instance-type
        operator: In
        values:
        - p5.48xlarge
      - key: nvidia.com/gpu
        operator: In
        values:
        - "true"
      - key: karpenter.sh/capacity-type
        operator: In
        values:
        - on-demand
      - key: nodepool
        operator: In
        values:
        - gpu-p5-48-llm
      - key: kubernetes.io/arch
        operator: In
        values:
        - amd64
      - key: kubernetes.io/os
        operator: In
        values:
        - linux
      taints:
      - effect: NoSchedule
        key: llm
        value: "true"
      - effect: NoSchedule
        key: nvidia.com/gpu
        value: "true"

Versions:

WxFang commented 1 month ago

Update on the context: we have also deployed a daemonset called cost-placeholder. The idea is to deploy low-priority pods for better cost attribution of idle cost. And the spec looks as below:

    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nodepool
                operator: In
                values:
                - gpu-p5-48-llm

       resources:
          limits:
            cpu: "191"
            memory: 1995G
            nvidia.com/gpu: "8"
          requests:
            cpu: "191"
            memory: 1995G
            nvidia.com/gpu: "8"
      priorityClassName: low-priority
      restartPolicy: Always
      schedulerName: default-scheduler

This hacky solution is working fine with CAS.

jmdeal commented 1 month ago

I'm going to close this issue out here, could you open this up as a feature request in the upstream repo, kubernetes-sigs/karpenter? This is currently intended behavior, Karpenter is going to attempt to schedule all pods, including those with a low-priority.