Karpenter provisions multiple duplicate nodeclaims / nodes for a single pod, single GPU workload

jcmcken commented 3 weeks ago

Description

Observed Behavior:

Start with zero NVIDIA GPU nodes in the cluster.
Configure a node pool to automatically provision GPU nodes on request. (See config below)
Launch a CUDA vector add sample workload that requests 1 GPU (in our case, we launch it as a Job -- see config below)
Observe the Karpenter logs. You'll notice that it finds provisionable pods multiple times for the same exact pod. It also creates multiple nodeclaims and launches multiple nodes. (In tests I've run, it tends to create 3 nodeclaims, and provision 2 actual instances, all within the same AZ)
After awhile, the Job (step 3) completes.
Because of consolidation, all of the multiple nodes and nodeclaims related to this GPU workload get cleaned up.

It seems to provision multiple instances, and one "wins" receiving the workload, after which the duds get cleaned up. It's almost as if the workload hasn't scheduled onto the node within a few seconds, it wants to try again as if it thinks the node is bad. Yet it doesn't say anything like that in the logs, and it doesn't try to terminate it until the Job is complete. Or else it's running its reconciliation loop in multiple "threads" or contexts that aren't communicating properly, duplicating the provisioning work.

I've also tried enabling DEBUG logging, but I don't see anything particularly useful.

Expected Behavior:

Only a single GPU nodeclaim and node gets provisioned.

Reproduction Steps (Please include YAML):

Workload:

apiVersion: batch/v1
kind: Job
metadata:
  name: vector-add
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: vector-add
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda-latest
        imagePullPolicy: Always
        command: ["/cuda-samples/sample"]
        resources:
          limits:
            cpu: "1000m"
            memory: "1Gi"
            nvidia.com/gpu: 1
          requests:
            cpu: "500m"
            memory: "500Mi"
            nvidia.com/gpu: 1

Node pool:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-pool
spec:
  template:
    spec:
      # See https://karpenter.sh/docs/reference/instance-types/
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values:
          - g
          - p
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values:
          - xlarge
          - 2xlarge
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values:
          - t4   # g4
          - a10g # g5
          - l4   # g6
          - v100 # p3
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]
      taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule
      nodeClassRef:
        name: gpu-pool
  limits:
    cpu: 100
    memory: 500Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # 30 * 24h = 720h
    budgets:
    - nodes: "1"

Node class:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: gpu-pool
spec:
  amiFamily: Bottlerocket
  amiSelectorTerms:
  - id: ami-00cfbfa2d5b5c2711
  instanceProfile: "<sanitized>"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "sanitized"
  securityGroupSelectorTerms:
    - tags:
        "aws:eks:cluster-name": "sanitized"
  userData: |
    [settings.host-containers.admin]
    enabled = true
    [settings.kernel.sysctl]
    "net.ipv4.tcp_keepalive_intvl" = "75"
    "net.ipv4.tcp_keepalive_probesc" =  "9"
    "net.ipv4.tcp_keepalive_time" = "300"
    [metrics]
    send-metrics =  false
    motd =  "Hello, eksctl!"
    [settings.pki.bundle1]
    sanitized
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: required

  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 4Gi
        volumeType: gp3
    - deviceName: /dev/xvdb
      ebs:
        volumeSize: 225Gi
        volumeType: gp3

Versions:

Chart Version: 0.36.1
Platform: EKS
Kubernetes Version (kubectl version): v1.29.4-eks-036c24b
OS: Bottlerocket 1.20.1 (aws-k8s-1.29), Kernel version: 6.1.90
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jcmcken commented 3 weeks ago

As a side note, this behavior also occurs if I scale Karpenter to a single pod. So it doesn't seem related to having multiple Karpenter replicas

jmdeal commented 3 weeks ago

Can you share your logs? Seeing multiple instances of "found provisionable pods" is expected, but Karpenter schedules against in-flight nodes as well as existing nodes in the cluster so this shouldn't result in duplicates.

jcmcken commented 3 weeks ago

I attached a stern log, starting from the first time "found provisionable pods" appears in the log for a particular pod. I sanitized a bunch of IDs in this log just to be safe.

In the logs, 3 separate nodes get provisioned. This is using the Job workload I put in the OP

stern.log

jmdeal commented 2 weeks ago

Interesting, based on the logs it looks like Karpenter no longer believed the pod would schedule to the node once it had registered, triggering the provisioner to create a new NodeClaim. Are there any components adding a taint to the node that the job does not tolerate that's later removed? That would explain this behavior.