aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.71k stars 940 forks source link

Reduction in total number of nodes after upgrading to 0.31.3 #5491

Open anshusingh64 opened 8 months ago

anshusingh64 commented 8 months ago

Description

Observed Behavior:

After upgrading Karpenter chart version from 0.29.0 to 0.31.3 we saw a decrease in total number of nodes provisioned by karpenter which persisted for around 6 days until we rolled it back (to 0.29.0) which affected the workloads running on our cluster. Following the rollback the things returned to normalcy, reverting to its pre-upgrade state

image

Some other unusual fluctuations were also observed during the incident like

Expected Behavior: The total node count and count of different instance types should have shown similar trend even after upgrade.

Versions:

jmdeal commented 8 months ago

To confirm, were the workloads running on the cluster consistent throughout the upgrade period? You say workloads were effected, is this because Karpenter wasn't provisioning enough capacity for the pods or because of the change to a different instance type? When you say there were instance types with no name how were you gathering those metrics? Does this represent NodeClaims that failed to launch an instance. There's a fair bit more information we need here to diagnose this I think. Are you able to provide any of your Karpenter resources (provisioners, node templates, node claims, etc) as well as logs from the upgrade event?

anshusingh64 commented 8 months ago

Hi @jmdeal

were the workloads running on the cluster consistent throughout the upgrade period?

Yes the workloads were consistent throughout the node reduction window, we can see that from the very first graph as well that after rollback the node count began to show improvement.

workloads were effected, is this because Karpenter wasn't provisioning enough capacity for the pods or because

The main cause seems to be Karpenter not provisioning enough capacity for the pods (from node count graph)

When you say there were instance types with no name how were you gathering those metrics?

During our analysis regarding the instance types with no name, we found out that when any instance is deprovisioned within few seconds of its launch, it comes up with N/A. Hence it seems that a significant number of instances were deprovisioned or consolidated shortly after launch during that timeframe

Does this represent NodeClaims that failed to launch an instance

We're not using NodeClaims as of now.

Are you able to provide any of your Karpenter resources (provisioners, node templates, node claims, etc) as well as logs from the upgrade event?

Here is one of the provisioner

`apiVersion: karpenter.sh/v1alpha5 kind: Provisioner metadata: annotations: karpenter.sh/provisioner-hash: 'xxxxxxxxxxxxxxx10' name: xxxx-provisioner spec: consolidation: enabled: true kubeletConfiguration: containerRuntime: containerd imageGCHighThresholdPercent: 70 imageGCLowThresholdPercent: 50 labels: karpenter-provisioner-type: xxxxx providerRef: name: xxxx-provisioner-template requirements:

anshusingh64 commented 8 months ago

By the way, we've upgraded Karpenter to v0.31.3 again with consolidation disabled, and things seems good so far. Could you please verify if there are any consolidation related changes between v0.29.0 and v0.31.3 that might have triggered the issue.

Also while going through the release notes I came across this PR which says we no longer wait for all pods to be scheduled before consolidating a node. However, I am unsure about what all criteria it checks or if there is a violation occurring somewhere, leading to consolidation even when it is not required and affecting the workloads.