aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.88k stars 970 forks source link

Karpenter consolidation rate is very high after upgrade #7344

Open sushama-kothawale opened 3 weeks ago

sushama-kothawale commented 3 weeks ago

Description

Observed Behavior:

> helm history karpenter -n karpenter 
> REVISION  UPDATED                     STATUS      CHART               APP VERSION DESCRIPTION     
> 1         Wed Nov 22 09:57:48 2023    superseded  karpenter-v0.31.0   0.31.0      Install complete
> 2         Tue Jan 30 10:57:11 2024    superseded  karpenter-v0.31.3   0.31.3      Upgrade complete
> 3         Fri Feb  2 00:47:37 2024    superseded  karpenter-v0.32.5   0.32.5      Upgrade complete
> 4         Sun Aug  4 21:13:37 2024    superseded  karpenter-0.35.4    0.35.4      Upgrade complete
> 5         Sun Aug  4 21:27:40 2024    deployed    karpenter-0.36.2    0.36.2      Upgrade complete

After karpeneter is upgraded to 0.36.2 version, we observed there are too many node restarts i.e. consolidation rate is very high. Due to this on our production environment new nodes are spinning very fast . Please refer below cmd: `kubectl get nodes --output=json | jq '[.items[] | select(.metadata.creationTimestamp | fromdateiso8601 > (now - 86400))] | length'

30 ` This means within last 24 hpurs 30 nodes were spinned newly, resulting all the workloads are shifted to these new nodes. We are seeing very high cost spike due to this. As the new nodes are sppinned, all the application pull ECR images again on these nodes resulting $3-4K extra cost going in these.

Expected Behavior: cosolidation rate should be normal i.e. till 0.32.5 version we have not observed this issue.

Reproduction Steps (Please include YAML): Attaching karpeneter deployer yaml and one of the nodepool config. loki-nodepool.txt karpenter-deployment.txt

Versions:

sushama-kothawale commented 2 weeks ago

@jonathan-innis @engedaam @njtran Can someone please check this? This is really impacting AWS cost.

Similar behaviour is observed for EKS version: 1.30 and karpenter version:: 0.37.2

jigisha620 commented 2 weeks ago

@sushama-kothawale Can you share karpenter controller logs from when this happened?

jigisha620 commented 2 weeks ago

Also wondering if you directly upgraded from v0.32.5 to 0.36.2? I would definitely not recommend doing that without going through every other version so more like 32->33->34->35->36. If you followed the correct path, my next question would be why do you not expect to see the node churn that you see?

frimik commented 2 weeks ago

I guess these findings from my point of view is a Work In Progress, so take it with a grain of salt and strictly for discussion purposes;

Karpenter v1.0.6 here. Karpenter is working "fine". It does all of its things ... but it is churning nodes in certain clusters and nodepool spec combinations ...

One finding so far is that; if you think you will achieve stability by setting a higher value for consolidateAfter:, you may actually be fooled into two things:

  1. consolidation will be less efficient over time as the chances of the "correct" nodes being consolidatable at any point in time is lower - ie. MORE nodes will slowly drift into being low utilization, and pods coming and going will randomly "touch" the node from being consolidatable (it could be horizontalpodautoscaler adding pods, cronjobs creating and removing pods....)
  2. The chances of a high number of nodes being "almost consolidatable" is now larger, thus you might end up in a scenario where you keep "randomly" churning through many of your nodes over time, as there is now SEVERAL UnderUtilized nodes as the "binpacking" gets worse and worse over time.

So, setting the consolidateAfter to a low value, such as consolidateAfter: 30s or 1m will increase the chances of CORRECT consolidation, thus packing things tigher across ALL nodes. What you probably should end up with is that there's only one or a few "tail" nodes that is worse packed than the others - and how bad that node gets packed and thus how often THAT node keeps getting consolidated may be depending on your nodepool spec (is a smaller node even available to be picked? do you have cronjobs that keep creating a scale-up, followed by a scale-down?). My experience so far is that shorter consolidateAfter helps because Karpenter can pack most nodes better thus creating a more predictable "tail" of nodes... but I kindof wish it could be doing LESS flapping on the tail-end of the nodes so to speak.

TL;DR -

sushama-kothawale commented 2 weeks ago

@jigisha620 Firstly thanks for looking into this quickly. To answer your questions, Its not direct upgrade, sharing karpenter upgrade history here: As and when we upgrade k8s cluster, we upgrade karpenter wrt eks compatible version.

helm history karpenter -n karpenter
REVISION    UPDATED                     STATUS      CHART               APP VERSION DESCRIPTION                                                    
1           Wed Nov 22 09:57:48 2023    superseded  karpenter-v0.31.0   0.31.0      Install complete                                               
2           Tue Jan 30 10:57:11 2024    superseded  karpenter-v0.31.3   0.31.3      Upgrade complete                                               
3           Fri Feb  2 00:47:37 2024    superseded  karpenter-v0.32.5   0.32.5      Upgrade complete                                               
4           Sun Aug  4 21:13:37 2024    superseded  karpenter-0.35.4    0.35.4      Upgrade complete                                               
5           Sun Aug  4 21:27:40 2024    superseded  karpenter-0.36.2    0.36.2      Upgrade complete                                               
6           Sun Nov 10 01:44:43 2024    failed      karpenter-0.37.2    0.37.2      Upgrade "karpenter" failed: timed out waiting for the condition
7           Sun Nov 10 02:00:52 2024    failed      karpenter-0.37.2    0.37.2      Upgrade "karpenter" failed: timed out waiting for the condition
8           Sun Nov 10 02:13:22 2024    deployed    karpenter-0.37.2    0.37.2      Upgrade complete      

In the last 24 hours, we've observed a significant node churn, with over 30 nodes being replaced in our ~80-node cluster. There have been no new deployments or changes in resource requirements in production, so it’s unclear why nodes are cycling at this rate.

As I'm new to Karpenter and still familiarizing myself with this setup, I’d appreciate assistance in investigating this issue. High node churn leads to new nodes being launched frequently, which in turn increases costs for AWS services—for instance, each new node pulls images from ECR, resulting in higher ECR expenses, as well as additional networking costs.

Any guidance or recommendations on how to stabilize node behavior and reduce unnecessary churn would be valuable.

Attaching the karpenter logs here, there are 2 pods running for karpenter: 1st pod logs:

`{"level":"INFO","time":"2024-11-10T02:12:07.674Z","logger":"controller","message":"webhook disabled","commit":"6e9d95f"}
{"level":"INFO","time":"2024-11-10T02:12:07.674Z","logger":"controller.controller-runtime.metrics","message":"Starting metrics server","commit":"6e9d95f"}
{"level":"INFO","time":"2024-11-10T02:12:07.675Z","logger":"controller","message":"starting server","commit":"6e9d95f","name":"health probe","addr":"[::]:8081"}
{"level":"INFO","time":"2024-11-10T02:12:07.675Z","logger":"controller.controller-runtime.metrics","message":"Serving metrics server","commit":"6e9d95f","bindAddress":":8000","secure":false}
{"level":"INFO","time":"2024-11-10T02:12:08.576Z","logger":"controller","message":"attempting to acquire leader lease karpenter/karpenter-leader-election...","commit":"6e9d95f"}`

Attaching the other one karpenter-pod2.txt

jortkoopmans commented 2 weeks ago

In the last 24 hours, we've observed a significant node churn, with over 30 nodes being replaced in our ~80-node cluster. There have been no new deployments or changes in resource requirements in production, so it’s unclear why nodes are cycling at this rate.

@sushama-kothawale ; One specific possible reason for high churn, is if you're using the latest AL2023 AMI which was released very recently (and rolled out to various regions/zones): https://github.com/awslabs/amazon-eks-ami/releases/tag/v20241109 . This was the case for my clusters today at least. Note that the EKS/kubelet version is the same, but typically the version will be listed in the Node 'System Info > OS Image' EDIT: it's not the same as the release, its Amazon Linux 2023.6.20241031 now)

EDIT2: You should be seeing this as reason: 'drifted' on your Karpenter (metrics)

sushama-kothawale commented 2 weeks ago

@jortkoopmans In our setup its quite stable as we control amiselctors in ec2instance class config. So AMIs are updated periodically/manually. so node chrun due to AMI change / drift is not the case in ours.

thomaspeitz commented 2 weeks ago
image

We see a lot more scalings as well, which did not start with finishing the upgrade but with the release of a new week which is strange. Maybe due to cron?

sushama-kothawale commented 2 weeks ago

@jigisha620 Could you please review this? I’ve shared the logs and all relevant information above. Let me know if you need any more details.

shaikmoeed commented 1 week ago

@sushama-kothawale Have you found any solution/workaround? We are facing the similar issue.

thomaspeitz commented 1 week ago

How we worked around it - Allowing it to only downscale one minute every 30m. Stable enough that we enabled karpenter on all envs again.

For folks who are new to budgets. It allows to scale down 1 node at a time for one minute every hour. Then it blocks it for 29 minutes. (by allowing to scale down 0 nodes) Then it allows one node at a time for 1m again. Then it blocks it again for 29 minutes.

This way it works fine. But that should not be the end of the improvement here. We put karpenter upgrade with this budget tuning into work done and wait now for the next release to iterrate over it.

          - nodes: "1"
            schedule: "0 * * * *"
            duration: 1m0s
          - nodes: "0"
            schedule: "1 * * * *"
            duration: 29m0s
          - nodes: "1"
            schedule: "30 * * * *"
            duration: 1m0s
          - nodes: "0"
            schedule: "31 * * * *"
            duration: 29m0s
sushama-kothawale commented 4 days ago

thanks @thomaspeitz for sharing this. I have applied below disruption budgets in nodepool, as we have to delete/consolide nodes a particular window i.e. hourly basis.

  disruption:
    budgets:
      - duration: 5m
        nodes: 10%
        schedule: '@hourly'
    consolidateAfter: 2m

Since I applied these changes i do not see any single event of disruption i.e. disrupting via consolidation delete . Its been 5+hours now. can you please help here to identify if I am doing something wrong in the config? karpenter-0.37.2 - karpenter version

Attaching the nodepool config here. nodepool-config-budget.txt

How to check if the configuration picked correctly so that cosolidation work as expected?

thomaspeitz commented 4 days ago

consolidationPolicy: WhenEmptyOrUnderutilized is supported with v1 of Karpenter consolidationPolicy: WhenUnderutilized | WhenEmptyis supported with 0.37 of Karpenter

Newest version of 0.37 should support probably in the spec all, but code maybe not. Otherwise down / upgrades of software would not work.

Maybe this is already the issue here.

sushama-kothawale commented 3 days ago

thanks @thomaspeitz for the reply! Seems like the same case with me. Karpenter version 0.37 supports v1 api, so I updated the nodepool config with v1 as apiVersion and updated the above disruption budgets. But its not working as expected Maybe karpenter version issue. Will try that now, hoping that will not get issues while upgrade as I have already updated the nodepool and ec2nodeclass apiversion to v1.