Karpenter workload consolidation/defragmentation

felix-zhe-huang commented 2 years ago

Tell us about your request As a cluster admin, I want Karpenter to consolidation the application workloads by moving pods to a fewer worker nodes and scale down the cluster so that I can improve the cluster resource utilization rate.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? In an under-utilized cluster, application pods are spread across worker nodes with exceeding among of resources. This wasteful situation can be improved by carefully packing the pods to a smaller number of worker nodes with the right size. Current version of Karpenter does not support rearranging pods and continuously improve the cluster utilization. The workload consolidation feature is the missing important component to complete the cluster scaling life cycle management loop.

This workload consolidation feature is nontrivial because of the following coupling problems.

Pod Packing: The pod packing problem determines which pods should be hosted together by the same worker node according to their taints and constraints. The goal is to produce a fewer well-balanced groups of pods that can be hosted by worker nodes with just the right size.
Instance Type Selection: According to the pod packing solution, the instance type selection problem determines which combination of instance types should be used to host the pods after the rearrangement.

The above problems deeply couple together so that one solution affect the other. Together the problem is a variant of the bin packing problem which is NP-complete. A practical solution will implements a quick heuristic algorithm that utilizes the special structure of the problem for specific use cases and user preferences. Therefore, thorough discussions with the customer is important.

Are you currently working around this issue? Currently Karpenter will scale down empty nodes automatically. However, it does not actively move pods around to create empty nodes.

Additional context Currently the workload consolidation feature is in the design phase. We should gather inputs from the customers about their objectives and preferences.

Attachments If you think you might have additional information that you'd like to include via an attachment, please do - we'll take a look. (Remember to remove any personally-identifiable information.)

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

kahirokunn commented 1 year ago

@tzneal I am using Karpenter v0.18.1. The spot instance does not scale down when I enable consolidation. So I found this statement.

https://github.com/aws/karpenter/issues/1091#issuecomment-1208483405

In which version will this feature be released? Thx.

universam1 commented 1 year ago

The constraint to on-demand comes also to disappointment for us as we have clusters with solely spot instances running. Would @tzneal be willing to make it available via a feature flag? Thanks

FernandoMiguel commented 1 year ago

@tzneal I am using Karpenter v0.18.1. The spot instance does not scale down when I enable consolidation. So I found this statement.

#1091 (comment)

In which version will this feature be released? Thx.

@kahirokunn you wish to replace spot nodes with cheaper ones that can actually be killed sooner than the one your workloads are before consolidating ?

dekelev commented 1 year ago

@FernandoMiguel I'm not sure how complicated is this to implement, but there's an AWS page here showing "Frequency of interruption" per instance type and with some types (e.g. m6i.large & m6i.xlarge) it is lower than 5% and should be considered as a very low factor when consolidating a very large and mostly idle instance into cheaper instances like m6i.xlarge. BTW I have a small list of instance types that I run Karpenter with.

FernandoMiguel commented 1 year ago

not sure what workloads you run, but i tend to prefer my hosts to not change often. hence why deeper pools are prefered.

dekelev commented 1 year ago

I'm using Karpenter for non-sensitive workloads with a lot of peaks during the day that creates huge servers sometimes, which is good for me instead of having many small ones, but after an hour when the peak is down, it usually needs to be consolidated into smaller instances, because the cluster is left with huge servers that are mostly idle.

matti commented 1 year ago

@dekelev I don't run karpenter, but I have similar problem with "plain" EKS and (managed) nodegroups. I've solved it like this:

https://github.com/matti/k8s-nodeflow/blob/main/k8s-nodeflow.sh

this is running in the cluster and it ensures that all machines are constantly drained within N seconds.

this requires proper configuration of PodDisruptionPolicies and also https://github.com/kubernetes-sigs/descheduler is recommended to "consolidate" low-utilization nodes.

I think this OR something similar could work with Karpenter.

Btw my email is in my github profile if you want to have a call or something about this - my plan is to develop these things further at some point and having valid use cases would be helpful for me.

Vlaaaaaaad commented 1 year ago

+1, I'd love a flag to enable consolidation for Spot instances too.

Workloads vary wildly and I agree that by default this flag should be off, but it should be an option for the folks that need it. For example, I often see clusters with a bunch of 24xlarge nodes that have <1% utilization. We use node-problem-detector/descheduler to work around this, but Karpenter consolidating Spot nodes natively would be a much better solution.

universam1 commented 1 year ago

@FernandoMiguel Spot instances are naturaly accompanied with workload that supports interruptions, guess no one will use spot when the workload is allergic against!? We have also different use cases where we dislike interruptions and others with high spikes. Consider deployments with antifaffinity for nodes, being grouped by Karpenter at provisioning but not reconsidered later which leaves those huge instances running.

However, I could picture a case where Karpenter tries to schedule a smaller node which is however unavailable, falling back to bigger instance and thus running into a loop!? Not sure if K. is doing a preflight in such case?

tzneal commented 1 year ago

The original problem that forced not replacing spot nodes with smaller spot nodes is that the straight forward approach eliminates the utility of the capacity optimized prioritized strategy that we use for launching spot nodes which considers pool depth and leads to launching nodes that may be slightly bigger than necessary, but are less likely to be reclaimed.

If consolidation did replace spot nodes, we would often launch a spot node, get one that's slightly too big due to pool depth and interruption rate, and then immediately replace it with a cheaper instance type with a higher interruption rate. This turns our capacity optimized prioritized strategy into a Karpenter operated lowest price strategy which isn't recommended for use with spot.

I'll be locking this issue as it was for the original consolidation feature and it's difficult to track new feature requests on closed issues. Feel free to open a new one specifically for spot replacement for that discussion.

aws / karpenter-provider-aws

Karpenter workload consolidation/defragmentation #1091

Community Note