Open sushama-kothawale opened 3 weeks ago
@jonathan-innis @engedaam @njtran Can someone please check this? This is really impacting AWS cost.
Similar behaviour is observed for EKS version: 1.30 and karpenter version:: 0.37.2
@sushama-kothawale Can you share karpenter controller logs from when this happened?
Also wondering if you directly upgraded from v0.32.5
to 0.36.2
? I would definitely not recommend doing that without going through every other version so more like 32->33->34->35->36.
If you followed the correct path, my next question would be why do you not expect to see the node churn that you see?
I guess these findings from my point of view is a Work In Progress, so take it with a grain of salt and strictly for discussion purposes;
Karpenter v1.0.6 here. Karpenter is working "fine". It does all of its things ... but it is churning nodes in certain clusters and nodepool spec combinations ...
One finding so far is that; if you think you will achieve stability by setting a higher value for consolidateAfter:
, you may actually be fooled into two things:
UnderUtilized
nodes as the "binpacking" gets worse and worse over time.So, setting the consolidateAfter
to a low value, such as consolidateAfter: 30s
or 1m
will increase the chances of CORRECT consolidation, thus packing things tigher across ALL nodes. What you probably should end up with is that there's only one or a few "tail" nodes that is worse packed than the others - and how bad that node gets packed and thus how often THAT node keeps getting consolidated may be depending on your nodepool spec (is a smaller node even available to be picked? do you have cronjobs that keep creating a scale-up, followed by a scale-down?). My experience so far is that shorter consolidateAfter helps because Karpenter can pack most nodes better thus creating a more predictable "tail" of nodes... but I kindof wish it could be doing LESS flapping on the tail-end of the nodes so to speak.
TL;DR -
consolidateAfter:
value makes consolidation more difficult and more random.@jigisha620 Firstly thanks for looking into this quickly. To answer your questions, Its not direct upgrade, sharing karpenter upgrade history here: As and when we upgrade k8s cluster, we upgrade karpenter wrt eks compatible version.
helm history karpenter -n karpenter
REVISION UPDATED STATUS CHART APP VERSION DESCRIPTION
1 Wed Nov 22 09:57:48 2023 superseded karpenter-v0.31.0 0.31.0 Install complete
2 Tue Jan 30 10:57:11 2024 superseded karpenter-v0.31.3 0.31.3 Upgrade complete
3 Fri Feb 2 00:47:37 2024 superseded karpenter-v0.32.5 0.32.5 Upgrade complete
4 Sun Aug 4 21:13:37 2024 superseded karpenter-0.35.4 0.35.4 Upgrade complete
5 Sun Aug 4 21:27:40 2024 superseded karpenter-0.36.2 0.36.2 Upgrade complete
6 Sun Nov 10 01:44:43 2024 failed karpenter-0.37.2 0.37.2 Upgrade "karpenter" failed: timed out waiting for the condition
7 Sun Nov 10 02:00:52 2024 failed karpenter-0.37.2 0.37.2 Upgrade "karpenter" failed: timed out waiting for the condition
8 Sun Nov 10 02:13:22 2024 deployed karpenter-0.37.2 0.37.2 Upgrade complete
In the last 24 hours, we've observed a significant node churn, with over 30 nodes being replaced in our ~80-node cluster. There have been no new deployments or changes in resource requirements in production, so it’s unclear why nodes are cycling at this rate.
As I'm new to Karpenter and still familiarizing myself with this setup, I’d appreciate assistance in investigating this issue. High node churn leads to new nodes being launched frequently, which in turn increases costs for AWS services—for instance, each new node pulls images from ECR, resulting in higher ECR expenses, as well as additional networking costs.
Any guidance or recommendations on how to stabilize node behavior and reduce unnecessary churn would be valuable.
Attaching the karpenter logs here, there are 2 pods running for karpenter: 1st pod logs:
`{"level":"INFO","time":"2024-11-10T02:12:07.674Z","logger":"controller","message":"webhook disabled","commit":"6e9d95f"}
{"level":"INFO","time":"2024-11-10T02:12:07.674Z","logger":"controller.controller-runtime.metrics","message":"Starting metrics server","commit":"6e9d95f"}
{"level":"INFO","time":"2024-11-10T02:12:07.675Z","logger":"controller","message":"starting server","commit":"6e9d95f","name":"health probe","addr":"[::]:8081"}
{"level":"INFO","time":"2024-11-10T02:12:07.675Z","logger":"controller.controller-runtime.metrics","message":"Serving metrics server","commit":"6e9d95f","bindAddress":":8000","secure":false}
{"level":"INFO","time":"2024-11-10T02:12:08.576Z","logger":"controller","message":"attempting to acquire leader lease karpenter/karpenter-leader-election...","commit":"6e9d95f"}`
Attaching the other one karpenter-pod2.txt
In the last 24 hours, we've observed a significant node churn, with over 30 nodes being replaced in our ~80-node cluster. There have been no new deployments or changes in resource requirements in production, so it’s unclear why nodes are cycling at this rate.
@sushama-kothawale ; One specific possible reason for high churn, is if you're using the latest AL2023 AMI which was released very recently (and rolled out to various regions/zones): https://github.com/awslabs/amazon-eks-ami/releases/tag/v20241109 . This was the case for my clusters today at least. Note that the EKS/kubelet version is the same, but typically the version will be listed in the Node 'System Info > OS Image' EDIT: it's not the same as the release, its Amazon Linux 2023.6.20241031 now)
EDIT2: You should be seeing this as reason: 'drifted' on your Karpenter (metrics)
@jortkoopmans In our setup its quite stable as we control amiselctors in ec2instance class config. So AMIs are updated periodically/manually. so node chrun due to AMI change / drift is not the case in ours.
We see a lot more scalings as well, which did not start with finishing the upgrade but with the release of a new week which is strange. Maybe due to cron?
@jigisha620 Could you please review this? I’ve shared the logs and all relevant information above. Let me know if you need any more details.
@sushama-kothawale Have you found any solution/workaround? We are facing the similar issue.
How we worked around it - Allowing it to only downscale one minute every 30m. Stable enough that we enabled karpenter on all envs again.
For folks who are new to budgets. It allows to scale down 1 node at a time for one minute every hour. Then it blocks it for 29 minutes. (by allowing to scale down 0 nodes) Then it allows one node at a time for 1m again. Then it blocks it again for 29 minutes.
This way it works fine. But that should not be the end of the improvement here. We put karpenter upgrade with this budget tuning into work done and wait now for the next release to iterrate over it.
- nodes: "1"
schedule: "0 * * * *"
duration: 1m0s
- nodes: "0"
schedule: "1 * * * *"
duration: 29m0s
- nodes: "1"
schedule: "30 * * * *"
duration: 1m0s
- nodes: "0"
schedule: "31 * * * *"
duration: 29m0s
thanks @thomaspeitz for sharing this. I have applied below disruption budgets in nodepool, as we have to delete/consolide nodes a particular window i.e. hourly basis.
disruption:
budgets:
- duration: 5m
nodes: 10%
schedule: '@hourly'
consolidateAfter: 2m
Since I applied these changes i do not see any single event of disruption i.e. disrupting via consolidation delete . Its been 5+hours now. can you please help here to identify if I am doing something wrong in the config? karpenter-0.37.2 - karpenter version
Attaching the nodepool config here. nodepool-config-budget.txt
How to check if the configuration picked correctly so that cosolidation work as expected?
consolidationPolicy: WhenEmptyOrUnderutilized
is supported with v1 of Karpenter
consolidationPolicy: WhenUnderutilized | WhenEmpty
is supported with 0.37 of Karpenter
Newest version of 0.37 should support probably in the spec all, but code maybe not. Otherwise down / upgrades of software would not work.
Maybe this is already the issue here.
thanks @thomaspeitz for the reply! Seems like the same case with me. Karpenter version 0.37 supports v1 api, so I updated the nodepool config with v1 as apiVersion and updated the above disruption budgets. But its not working as expected Maybe karpenter version issue. Will try that now, hoping that will not get issues while upgrade as I have already updated the nodepool and ec2nodeclass apiversion to v1.
Description
Observed Behavior:
After karpeneter is upgraded to 0.36.2 version, we observed there are too many node restarts i.e. consolidation rate is very high. Due to this on our production environment new nodes are spinning very fast . Please refer below cmd: `kubectl get nodes --output=json | jq '[.items[] | select(.metadata.creationTimestamp | fromdateiso8601 > (now - 86400))] | length'
30 ` This means within last 24 hpurs 30 nodes were spinned newly, resulting all the workloads are shifted to these new nodes. We are seeing very high cost spike due to this. As the new nodes are sppinned, all the application pull ECR images again on these nodes resulting $3-4K extra cost going in these.
Expected Behavior: cosolidation rate should be normal i.e. till 0.32.5 version we have not observed this issue.
Reproduction Steps (Please include YAML): Attaching karpeneter deployer yaml and one of the nodepool config. loki-nodepool.txt karpenter-deployment.txt
Versions:
kubectl version
): 1.28