Open anshusingh64 opened 8 months ago
To confirm, were the workloads running on the cluster consistent throughout the upgrade period? You say workloads were effected, is this because Karpenter wasn't provisioning enough capacity for the pods or because of the change to a different instance type? When you say there were instance types with no name how were you gathering those metrics? Does this represent NodeClaim
s that failed to launch an instance. There's a fair bit more information we need here to diagnose this I think. Are you able to provide any of your Karpenter resources (provisioners, node templates, node claims, etc) as well as logs from the upgrade event?
Hi @jmdeal
were the workloads running on the cluster consistent throughout the upgrade period?
Yes the workloads were consistent throughout the node reduction window, we can see that from the very first graph as well that after rollback the node count began to show improvement.
workloads were effected, is this because Karpenter wasn't provisioning enough capacity for the pods or because
The main cause seems to be Karpenter not provisioning enough capacity for the pods (from node count graph)
When you say there were instance types with no name how were you gathering those metrics?
During our analysis regarding the instance types with no name, we found out that when any instance is deprovisioned within few seconds of its launch, it comes up with N/A. Hence it seems that a significant number of instances were deprovisioned or consolidated shortly after launch during that timeframe
Does this represent NodeClaims that failed to launch an instance
We're not using NodeClaims as of now.
Are you able to provide any of your Karpenter resources (provisioners, node templates, node claims, etc) as well as logs from the upgrade event?
Here is one of the provisioner
`apiVersion: karpenter.sh/v1alpha5 kind: Provisioner metadata: annotations: karpenter.sh/provisioner-hash: 'xxxxxxxxxxxxxxx10' name: xxxx-provisioner spec: consolidation: enabled: true kubeletConfiguration: containerRuntime: containerd imageGCHighThresholdPercent: 70 imageGCLowThresholdPercent: 50 labels: karpenter-provisioner-type: xxxxx providerRef: name: xxxx-provisioner-template requirements:
key: kubernetes.io/os operator: In values:
I don't have logs from the upgrade event but the upgrade went completely fine and the actual issue started after the upgrade.
By the way, we've upgraded Karpenter to v0.31.3 again with consolidation disabled, and things seems good so far. Could you please verify if there are any consolidation related changes between v0.29.0 and v0.31.3 that might have triggered the issue.
Also while going through the release notes I came across this PR which says we no longer wait for all pods to be scheduled before consolidating a node. However, I am unsure about what all criteria it checks or if there is a violation occurring somewhere, leading to consolidation even when it is not required and affecting the workloads.
Description
Observed Behavior:
After upgrading Karpenter chart version from 0.29.0 to 0.31.3 we saw a decrease in total number of nodes provisioned by karpenter which persisted for around 6 days until we rolled it back (to 0.29.0) which affected the workloads running on our cluster. Following the rollback the things returned to normalcy, reverting to its pre-upgrade state
Some other unusual fluctuations were also observed during the incident like
Few new instance types came in that window
Instance type with no name (N/A) increased drastically
Some of the instance types like r7i.48xlarge, r6a.48xlarge count reduced drastically and some like r7a.48xlarge count increased
Expected Behavior: The total node count and count of different instance types should have shown similar trend even after upgrade.
Versions:
kubectl version
): 1.25