Reduction in total number of nodes after upgrading to 0.31.3

anshusingh64 commented 8 months ago

Description

Observed Behavior:

After upgrading Karpenter chart version from 0.29.0 to 0.31.3 we saw a decrease in total number of nodes provisioned by karpenter which persisted for around 6 days until we rolled it back (to 0.29.0) which affected the workloads running on our cluster. Following the rollback the things returned to normalcy, reverting to its pre-upgrade state

Some other unusual fluctuations were also observed during the incident like

Few new instance types came in that window
Instance type with no name (N/A) increased drastically
Some of the instance types like r7i.48xlarge, r6a.48xlarge count reduced drastically and some like r7a.48xlarge count increased

Expected Behavior: The total node count and count of different instance types should have shown similar trend even after upgrade.

Versions:

Chart Version: 0.31.3 (upgraded from 0.29.0)
Kubernetes Version (kubectl version): 1.25

jmdeal commented 8 months ago

To confirm, were the workloads running on the cluster consistent throughout the upgrade period? You say workloads were effected, is this because Karpenter wasn't provisioning enough capacity for the pods or because of the change to a different instance type? When you say there were instance types with no name how were you gathering those metrics? Does this represent NodeClaims that failed to launch an instance. There's a fair bit more information we need here to diagnose this I think. Are you able to provide any of your Karpenter resources (provisioners, node templates, node claims, etc) as well as logs from the upgrade event?

anshusingh64 commented 8 months ago

Hi @jmdeal

were the workloads running on the cluster consistent throughout the upgrade period?

Yes the workloads were consistent throughout the node reduction window, we can see that from the very first graph as well that after rollback the node count began to show improvement.

workloads were effected, is this because Karpenter wasn't provisioning enough capacity for the pods or because

The main cause seems to be Karpenter not provisioning enough capacity for the pods (from node count graph)

When you say there were instance types with no name how were you gathering those metrics?

During our analysis regarding the instance types with no name, we found out that when any instance is deprovisioned within few seconds of its launch, it comes up with N/A. Hence it seems that a significant number of instances were deprovisioned or consolidated shortly after launch during that timeframe

Does this represent NodeClaims that failed to launch an instance

We're not using NodeClaims as of now.

Are you able to provide any of your Karpenter resources (provisioners, node templates, node claims, etc) as well as logs from the upgrade event?

Here is one of the provisioner

`apiVersion: karpenter.sh/v1alpha5 kind: Provisioner metadata: annotations: karpenter.sh/provisioner-hash: 'xxxxxxxxxxxxxxx10' name: xxxx-provisioner spec: consolidation: enabled: true kubeletConfiguration: containerRuntime: containerd imageGCHighThresholdPercent: 70 imageGCLowThresholdPercent: 50 labels: karpenter-provisioner-type: xxxxx providerRef: name: xxxx-provisioner-template requirements:

key: topology.kubernetes.io/zone operator: In values:
- us-east-1a
- us-east-1b
- us-east-1c
key: kubernetes.io/arch operator: In values:
- amd64
- arm64
key: karpenter.sh/capacity-type operator: In values:
- on-demand
- spot
key: karpenter.k8s.aws/instance-size operator: In values:
- 4xlarge
- 6xlarge
- 8xlarge
- 9xlarge
- 10xlarge
- 12xlarge
- 16xlarge
- 18xlarge
- 24xlarge
- 32xlarge
- 48xlarge
key: karpenter.k8s.aws/instance-category operator: NotIn values:
- t
- a
- x
- h
key: karpenter.k8s.aws/instance-family operator: NotIn values:
- c1
- cc1
- cc2
- cg1
- cg2
- cr1
- c3
- r3
- g1
- g2
- hi1
- hs1
- m1
- m2
- m3
- t1
- t3
key: node.kubernetes.io/instance-type operator: NotIn values:
- g4ad.xlarge
- g4ad.2xlarge
- g4ad.4xlarge
- g4ad.8xlarge
- g4ad.16xlarge
key: kubernetes.io/os operator: In values:
- linux ttlSecondsUntilExpired: 1728000`
I don't have logs from the upgrade event but the upgrade went completely fine and the actual issue started after the upgrade.

anshusingh64 commented 8 months ago

By the way, we've upgraded Karpenter to v0.31.3 again with consolidation disabled, and things seems good so far. Could you please verify if there are any consolidation related changes between v0.29.0 and v0.31.3 that might have triggered the issue.

Also while going through the release notes I came across this PR which says we no longer wait for all pods to be scheduled before consolidating a node. However, I am unsure about what all criteria it checks or if there is a violation occurring somewhere, leading to consolidation even when it is not required and affecting the workloads.

aws / karpenter-provider-aws

Reduction in total number of nodes after upgrading to 0.31.3 #5491

Description