aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.16k stars 849 forks source link

Karpenter provisions multiple duplicate nodeclaims / nodes for a single pod, single GPU workload #6355

Open jcmcken opened 3 weeks ago

jcmcken commented 3 weeks ago

Description

Observed Behavior:

  1. Start with zero NVIDIA GPU nodes in the cluster.
  2. Configure a node pool to automatically provision GPU nodes on request. (See config below)
  3. Launch a CUDA vector add sample workload that requests 1 GPU (in our case, we launch it as a Job -- see config below)
  4. Observe the Karpenter logs. You'll notice that it finds provisionable pods multiple times for the same exact pod. It also creates multiple nodeclaims and launches multiple nodes. (In tests I've run, it tends to create 3 nodeclaims, and provision 2 actual instances, all within the same AZ)
  5. After awhile, the Job (step 3) completes.
  6. Because of consolidation, all of the multiple nodes and nodeclaims related to this GPU workload get cleaned up.

It seems to provision multiple instances, and one "wins" receiving the workload, after which the duds get cleaned up. It's almost as if the workload hasn't scheduled onto the node within a few seconds, it wants to try again as if it thinks the node is bad. Yet it doesn't say anything like that in the logs, and it doesn't try to terminate it until the Job is complete. Or else it's running its reconciliation loop in multiple "threads" or contexts that aren't communicating properly, duplicating the provisioning work.

I've also tried enabling DEBUG logging, but I don't see anything particularly useful.

Expected Behavior:

Only a single GPU nodeclaim and node gets provisioned.

Reproduction Steps (Please include YAML):

Workload:

apiVersion: batch/v1
kind: Job
metadata:
  name: vector-add
spec:
  template:
    spec:
      restartPolicy: OnFailure
      containers:
      - name: vector-add
        image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda-latest
        imagePullPolicy: Always
        command: ["/cuda-samples/sample"]
        resources:
          limits:
            cpu: "1000m"
            memory: "1Gi"
            nvidia.com/gpu: 1
          requests:
            cpu: "500m"
            memory: "500Mi"
            nvidia.com/gpu: 1

Node pool:

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-pool
spec:
  template:
    spec:
      # See https://karpenter.sh/docs/reference/instance-types/
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values:
          - g
          - p
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values:
          - xlarge
          - 2xlarge
        - key: karpenter.k8s.aws/instance-gpu-name
          operator: In
          values:
          - t4   # g4
          - a10g # g5
          - l4   # g6
          - v100 # p3
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: kubernetes.io/os
          operator: In
          values: ["linux"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
        - key: "topology.kubernetes.io/zone"
          operator: In
          values: ["us-east-1a", "us-east-1b", "us-east-1c"]
      taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule
      nodeClassRef:
        name: gpu-pool
  limits:
    cpu: 100
    memory: 500Gi
  disruption:
    consolidationPolicy: WhenUnderutilized
    expireAfter: 720h # 30 * 24h = 720h
    budgets:
    - nodes: "1"

Node class:

apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: gpu-pool
spec:
  amiFamily: Bottlerocket
  amiSelectorTerms:
  - id: ami-00cfbfa2d5b5c2711
  instanceProfile: "<sanitized>"
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: "sanitized"
  securityGroupSelectorTerms:
    - tags:
        "aws:eks:cluster-name": "sanitized"
  userData: |
    [settings.host-containers.admin]
    enabled = true
    [settings.kernel.sysctl]
    "net.ipv4.tcp_keepalive_intvl" = "75"
    "net.ipv4.tcp_keepalive_probesc" =  "9"
    "net.ipv4.tcp_keepalive_time" = "300"
    [metrics]
    send-metrics =  false
    motd =  "Hello, eksctl!"
    [settings.pki.bundle1]
    sanitized
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 1
    httpTokens: required

  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 4Gi
        volumeType: gp3
    - deviceName: /dev/xvdb
      ebs:
        volumeSize: 225Gi
        volumeType: gp3

Versions:

jcmcken commented 3 weeks ago

As a side note, this behavior also occurs if I scale Karpenter to a single pod. So it doesn't seem related to having multiple Karpenter replicas

jmdeal commented 3 weeks ago

Can you share your logs? Seeing multiple instances of "found provisionable pods" is expected, but Karpenter schedules against in-flight nodes as well as existing nodes in the cluster so this shouldn't result in duplicates.

jcmcken commented 3 weeks ago

I attached a stern log, starting from the first time "found provisionable pods" appears in the log for a particular pod. I sanitized a bunch of IDs in this log just to be safe.

In the logs, 3 separate nodes get provisioned. This is using the Job workload I put in the OP

stern.log

jmdeal commented 2 weeks ago

Interesting, based on the logs it looks like Karpenter no longer believed the pod would schedule to the node once it had registered, triggering the provisioner to create a new NodeClaim. Are there any components adding a taint to the node that the job does not tolerate that's later removed? That would explain this behavior.

jcmcken commented 2 weeks ago

That might be it then. Yes, we use Karpenter's startup taint capability because we need things like CNIs, CSIs, log forwarding, security tool daemonsets to be started and healthy on the node before any pods get scheduled. We have something running in the cluster that checks the status of all the daemonsets and then removes this taint once they become healthy

It's similar to the use case described here, except we're waiting for a few more workloads, not just Cilium.

Prior to this tooling, the node would become Ready, actual workload pods (i.e. not cluster facilities like CSI, CNI, etc.) would schedule to the new node, and they would enter into a fail loop (with backoffs) since CSI, etc. wasn't ready on the node. This made rolling the cluster nodes much slower since we had to wait for all of these failed pods to finish reconciling. With the startup taint approach, we don't even allow the pods to schedule until our business rules are satisfied

jmdeal commented 2 weeks ago

Got it, I didn't see any startup taints added to your NodePool. Are you still running into this problem with those startup taints defined? Without them defined, Karpenter doesn't realize those taints will be removed and won't consider the pod compatible.

jcmcken commented 2 weeks ago

Sorry, you're right. This GPU node pool doesn't have the startup taints. I was thinking of our general workload pool in this cluster which has the startup taints. I just double checked gpu-pool for this cluster (the config pasted in OP) and it does not have startupTaints field defined.

Just to be clear, Karpenter adds the taints when it provisions the nodes. Right now, our custom tool runs as a CronJob every minute. It does the following:

jcmcken commented 2 weeks ago

I tried adding the startup taints to the GPU pool just to see if it performed any differently. It doesn't seem so.

One thing I noticed is the new nodes seem to come up every 20-30s until the pod successfully schedules. It seems to be relatively predictable. I'm not sure if there's any duration like that hard-coded somewhere

jmdeal commented 2 weeks ago

What I had noticed was that new nodes were created after a previously provisioned node had registered. This indicates to me that at that point Karpenter no longer believed the pod could schedule to the node. Are you able to check what the node looks like when it's first created (e.g. from the body of the CREATE request).

jcmcken commented 2 weeks ago

Is there anything in particular I should look for? From observation, I see a node come up within a few seconds starting in a NotReady state. After another few seconds it goes from NotReady to Ready. (We're using Bottlerocket, so it boots and becomes ready very quickly). But at that point, the node is still not technically ready from the point of view of our startup taint / custom process. So it will stay in Ready, but with the taint, for between 30 to 90 seconds. This delay comes from some commercial and other tools we're using that can take some time to fully pass their status checks.

From what you're suggesting, it sounds as if Karpenter considers a Ready node that the pod can't schedule to to be bad in some way, so it kicks off another node. If so, would there be any way to configure that wait period? I didn't really notice any relevant settings browsing the docs, but I might've missed it

I'll try to capture the state of the Node objects in more detail as it progresses from boot to ready to "really ready"

jmdeal commented 2 weeks ago

There's no explicit wait period, but if a node comes up and doesn't match what Karpenter expects (e.g. there are startup taints that weren't defined on the nodepool) Karpenter will no longer schedule the pending pod against that node in it's scheduling simulation, resulting in a new nodeclaim being created. It's hard to say exactly what the cause would be without knowing what the node looks like once it registers and through initialization. API server audit logs are honestly the best mechanism to diagnose this, but if it's not changing to quickly a kubectl get in that in-between period could do the trick.

jcmcken commented 2 weeks ago

There should not be any taints on the nodes that aren't defined in the node pool, but I'll check this. I'm not sure how you would even configure it to do that -- it's Karpenter launching these nodes after all. We don't have any custom boot process or something for these nodes outside of the Karpenter configs

jcmcken commented 2 weeks ago

So I do see additional taints present, none of which we add explicitly. It looks like Kubernetes node controller adds some automatically, and Cilium adds one automatically. For example if you look here, I see node.kubernetes.io/not-ready and node.cloudprovider.kubernetes.io/uninitialized present. I also see node.cilium.io/agent-not-ready present. We don't explicitly configure these to be added, but maybe it's the default configuration for these components.

I guess do we need to add these into our NodePool configs to make Karpenter aware of them? I just wonder if it makes sense. For example, normally (I'm guessing) the node controller controls the entire lifecycle (adding and deleting) of node.kubernetes.io/not-ready. It seems strange to tell Karpenter to also add node.kubernetes.io/not-ready. I hope there wouldn't be a race condition where the node controller adds and then removes a given taint, and then Karpenter adds it back and it remains indefinitely. I'm not sure this is even a worry, not entirely sure how all of this works

jcmcken commented 2 weeks ago

For Cilium, I see this Helm value. But we don't set this. We use the default of false. I also checked the deployed configmap and it's definitely not set. Curious

EDIT: Sorry, wrong setting. It's been a long day. This is the relevant setting, so I guess this defaults to true.

Anyways I'll try messing with the startup taints to see fi there's any different behavior

jmdeal commented 2 weeks ago

You shouldn't need to add tolerations for the cloudprovider or the node lifecycle taints, but you would for the cilium taint. This is actually the motivating example we use in our docs for the startupTaints field. If this taint is being added without the startup taint in the nodepool that likely explains the behavior.

jcmcken commented 2 weeks ago

Yep! It looks like that was it. It's strange all the docs talk as if you need to add these taints on your own. But then it turns out Cilium does it for you. For example, these Cilium docs. Well glad it turned out to be something simple.

I wonder if there should be an improvement to the logs to indicate what's happening, because even with the level set to debug, it's not really explanatory.

jcmcken commented 2 weeks ago

Is there any interaction between taints and startupTaints in NodePool? What I'm observing now is my GPU static taints are no longer added when I set startupTaints. This taint from my NodePool (see OP) no longer appears on nodes:

      taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule

EDIT: Nevermind, there's something else going on unrelated to Karpenter.


Another issue I'm noticing is that altering the startup taints appears to cause Karpenter to disrupt all the nodes in the node pool. That seems like it might be unintended? I didn't change taints key at all, just startupTaints. (EDIT: Looks like there's an issue for this already)

jcmcken commented 1 week ago

I think my issue is resolved. The cause was the Cilium startup taint we didn't realize was being added by Cilium. Adding that to our NodePool to match fixed the issue.

I'm not sure I want to close this issue necessarily, but I suppose I would be fine with it. If there were some improvements to the log messages when it runs into this behavior, that would be ideal. But otherwise having this conversation in the issue history is fine too