aws / karpenter-provider-aws

Karpenter is a Kubernetes Node Autoscaler built for flexibility, performance, and simplicity.
https://karpenter.sh
Apache License 2.0
6.77k stars 953 forks source link

Support concurrent drain for non-empty nodes #2510

Closed mtcode closed 1 year ago

mtcode commented 2 years ago

Tell us about your request

Currently, consolidation only supports draining empty nodes concurrently, and drains non-empty nodes sequentially. A significant improvement to this is supporting concurrent draining of non-empty nodes.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Game servers are stateful and have persistent player connections. Terminating hosts is impactful and requires advance notice so that connections can be drained from hosts, allowing players to connect to game servers hosted on other instances. This isn't just about game servers though - it can be generalized to any workload that cannot terminate immediately, blocking scale down of the node that it runs on.

Connection draining is controlled using graceful termination periods, which are typically set somewhere between 15 minutes and multiple hours. When hosts are drained sequentially, the rate of scale down is artificially gated, even if many nodes in a cluster are mostly empty.

In a cluster of 100 nodes whose pods can actually fit on 50 nodes, if each pod takes up to 30 minutes to drain, that scale down process would take 25 hours when draining nodes sequentially. Consider a scenario with production traffic that looks like a sine wave, with 12 hours scaling up and 12 hours scaling down. In this case, the cluster would never reach 50 nodes when it takes 25 hours to scale down from 100 to 50 nodes. In this case, the cluster would theoretically only reach 76 nodes, meaning it is only half as efficient as it could be.

Draining multiple nodes in parallel allows clusters to be scaled down much faster, avoiding significant unnecessary costs to users.

Are you currently working around this issue?

A custom node autoscaling solution was implemented in 2017 to support this and other scenarios more specific to game server workloads. Our current solution allows cluster nodes to be compacted as much as possible, where all workloads on candidate nodes can fit onto other existing, non-candidate capacity.

Cluster autoscaler plans to support this feature in its 1.26 release. See https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/parallel_drain.md

Additional Context

No response

Attachments

No response

Community Note

jonathan-innis commented 2 years ago

Linking CA issue around the same topic: https://github.com/kubernetes/autoscaler/issues/5079#event-7155704016

tzneal commented 2 years ago

Hey @mtcode, do you have some ideas for how you would want this to work specifically regarding any configuration or limitations?

mtcode commented 2 years ago

Hey @tzneal, the main requirement is that multiple nodes can be scaled down concurrently, but there are some interesting details in that process.

When there are multiple nodes that can be consolidated at once, they need to be ranked. When downscaling multiple at once, that ranking should be respected, continuing to remove the next lowest impact node by relocating its pods to other candidate nodes (those farther down the list) in the cluster, continuing down that list until pods don't fit on remaining nodes. Implementation-wise, the interesting part here is building that future state of pods on nodes in successive iterations, since beyond the first node, you're not looking at the current cluster state, but a potential future state. PDBs must continue to be respected when consolidating multiple nodes at once.

My guess is that there may also be cases where users want to impose a limit on the maximum number of nodes that can be downscaling simultaneously, so being able to configure a cap may be useful. There may also be users that want to consolidate as much as possible, as quickly as possible (that's my use case). In that case, the configuration value could be set to a large number, but there would need to be some testing done to determine what a reasonable value is, and what the performance characteristics are of increasing that value.

Does that help?