Open jcmcken opened 3 weeks ago
As a side note, this behavior also occurs if I scale Karpenter to a single pod. So it doesn't seem related to having multiple Karpenter replicas
Can you share your logs? Seeing multiple instances of "found provisionable pods" is expected, but Karpenter schedules against in-flight nodes as well as existing nodes in the cluster so this shouldn't result in duplicates.
I attached a stern
log, starting from the first time "found provisionable pods" appears in the log for a particular pod. I sanitized a bunch of IDs in this log just to be safe.
In the logs, 3 separate nodes get provisioned. This is using the Job
workload I put in the OP
Interesting, based on the logs it looks like Karpenter no longer believed the pod would schedule to the node once it had registered, triggering the provisioner to create a new NodeClaim. Are there any components adding a taint to the node that the job does not tolerate that's later removed? That would explain this behavior.
That might be it then. Yes, we use Karpenter's startup taint capability because we need things like CNIs, CSIs, log forwarding, security tool daemonsets to be started and healthy on the node before any pods get scheduled. We have something running in the cluster that checks the status of all the daemonsets and then removes this taint once they become healthy
It's similar to the use case described here, except we're waiting for a few more workloads, not just Cilium.
Prior to this tooling, the node would become Ready
, actual workload pods (i.e. not cluster facilities like CSI, CNI, etc.) would schedule to the new node, and they would enter into a fail loop (with backoffs) since CSI, etc. wasn't ready on the node. This made rolling the cluster nodes much slower since we had to wait for all of these failed pods to finish reconciling. With the startup taint approach, we don't even allow the pods to schedule until our business rules are satisfied
Got it, I didn't see any startup taints added to your NodePool. Are you still running into this problem with those startup taints defined? Without them defined, Karpenter doesn't realize those taints will be removed and won't consider the pod compatible.
Sorry, you're right. This GPU node pool doesn't have the startup taints. I was thinking of our general workload pool in this cluster which has the startup taints. I just double checked gpu-pool
for this cluster (the config pasted in OP) and it does not have startupTaints
field defined.
Just to be clear, Karpenter adds the taints when it provisions the nodes. Right now, our custom tool runs as a CronJob
every minute. It does the following:
I tried adding the startup taints to the GPU pool just to see if it performed any differently. It doesn't seem so.
One thing I noticed is the new nodes seem to come up every 20-30s until the pod successfully schedules. It seems to be relatively predictable. I'm not sure if there's any duration like that hard-coded somewhere
What I had noticed was that new nodes were created after a previously provisioned node had registered. This indicates to me that at that point Karpenter no longer believed the pod could schedule to the node. Are you able to check what the node looks like when it's first created (e.g. from the body of the CREATE request).
Is there anything in particular I should look for? From observation, I see a node come up within a few seconds starting in a NotReady
state. After another few seconds it goes from NotReady
to Ready
. (We're using Bottlerocket, so it boots and becomes ready very quickly). But at that point, the node is still not technically ready from the point of view of our startup taint / custom process. So it will stay in Ready
, but with the taint, for between 30 to 90 seconds. This delay comes from some commercial and other tools we're using that can take some time to fully pass their status checks.
From what you're suggesting, it sounds as if Karpenter considers a Ready
node that the pod can't schedule to to be bad in some way, so it kicks off another node. If so, would there be any way to configure that wait period? I didn't really notice any relevant settings browsing the docs, but I might've missed it
I'll try to capture the state of the Node
objects in more detail as it progresses from boot to ready to "really ready"
There's no explicit wait period, but if a node comes up and doesn't match what Karpenter expects (e.g. there are startup taints that weren't defined on the nodepool) Karpenter will no longer schedule the pending pod against that node in it's scheduling simulation, resulting in a new nodeclaim being created. It's hard to say exactly what the cause would be without knowing what the node looks like once it registers and through initialization. API server audit logs are honestly the best mechanism to diagnose this, but if it's not changing to quickly a kubectl get
in that in-between period could do the trick.
There should not be any taints on the nodes that aren't defined in the node pool, but I'll check this. I'm not sure how you would even configure it to do that -- it's Karpenter launching these nodes after all. We don't have any custom boot process or something for these nodes outside of the Karpenter configs
So I do see additional taints present, none of which we add explicitly. It looks like Kubernetes node controller adds some automatically, and Cilium adds one automatically. For example if you look here, I see node.kubernetes.io/not-ready
and node.cloudprovider.kubernetes.io/uninitialized
present. I also see node.cilium.io/agent-not-ready
present. We don't explicitly configure these to be added, but maybe it's the default configuration for these components.
I guess do we need to add these into our NodePool
configs to make Karpenter aware of them? I just wonder if it makes sense. For example, normally (I'm guessing) the node controller controls the entire lifecycle (adding and deleting) of node.kubernetes.io/not-ready
. It seems strange to tell Karpenter to also add node.kubernetes.io/not-ready
. I hope there wouldn't be a race condition where the node controller adds and then removes a given taint, and then Karpenter adds it back and it remains indefinitely. I'm not sure this is even a worry, not entirely sure how all of this works
For Cilium, I see this Helm value. But we don't set this. We use the default of false
. I also checked the deployed configmap and it's definitely not set. Curious
EDIT: Sorry, wrong setting. It's been a long day. This is the relevant setting, so I guess this defaults to true
.
Anyways I'll try messing with the startup taints to see fi there's any different behavior
You shouldn't need to add tolerations for the cloudprovider or the node lifecycle taints, but you would for the cilium taint. This is actually the motivating example we use in our docs for the startupTaints
field. If this taint is being added without the startup taint in the nodepool that likely explains the behavior.
Yep! It looks like that was it. It's strange all the docs talk as if you need to add these taints on your own. But then it turns out Cilium does it for you. For example, these Cilium docs. Well glad it turned out to be something simple.
I wonder if there should be an improvement to the logs to indicate what's happening, because even with the level set to debug, it's not really explanatory.
Is there any interaction between taints
and startupTaints
in NodePool
? What I'm observing now is my GPU static taints are no longer added when I set startupTaints
. This taint from my NodePool
(see OP) no longer appears on nodes:
taints:
- key: nvidia.com/gpu
value: "true"
effect: NoSchedule
EDIT: Nevermind, there's something else going on unrelated to Karpenter.
Another issue I'm noticing is that altering the startup taints appears to cause Karpenter to disrupt all the nodes in the node pool. That seems like it might be unintended? I didn't change taints
key at all, just startupTaints
. (EDIT: Looks like there's an issue for this already)
I think my issue is resolved. The cause was the Cilium startup taint we didn't realize was being added by Cilium. Adding that to our NodePool to match fixed the issue.
I'm not sure I want to close this issue necessarily, but I suppose I would be fine with it. If there were some improvements to the log messages when it runs into this behavior, that would be ideal. But otherwise having this conversation in the issue history is fine too
Description
Observed Behavior:
Job
-- see config below)Job
(step 3) completes.It seems to provision multiple instances, and one "wins" receiving the workload, after which the duds get cleaned up. It's almost as if the workload hasn't scheduled onto the node within a few seconds, it wants to try again as if it thinks the node is bad. Yet it doesn't say anything like that in the logs, and it doesn't try to terminate it until the
Job
is complete. Or else it's running its reconciliation loop in multiple "threads" or contexts that aren't communicating properly, duplicating the provisioning work.I've also tried enabling DEBUG logging, but I don't see anything particularly useful.
Expected Behavior:
Only a single GPU nodeclaim and node gets provisioned.
Reproduction Steps (Please include YAML):
Workload:
Node pool:
Node class:
Versions:
Chart Version: 0.36.1
Platform: EKS
Kubernetes Version (
kubectl version
): v1.29.4-eks-036c24bOS: Bottlerocket 1.20.1 (aws-k8s-1.29), Kernel version: 6.1.90
Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment