Autoscaler does not split scale-ups to match available IP addresses

MarekStrejczek-TomTom commented 3 years ago

What happened: Many pods were created by HPA, cluster-autoscaler kicked in but didn't add as many nodes as expected.

Autoscaler logs (from Log Analytics) say that it wants to add 30 new nodes: Final scale-up plan: [{aks-hugememory-35177555-vmss 5->35 (max: 39)}] We use kubenet networking and have 24 IP addresses in the subnet. Autoscaler didn't add any new nodes due to: Failed to increase capacity for scale set "aks-hugememory-35177555-vmss" to 30: Subnet sn-[autoscaler.log](https://github.com/Azure/AKS/files/7092704/autoscaler.log)-dev with address prefix 10.133.123.0/26 does not have enough capacity for 30 IP addresses.

What you expected to happen: Autoscaler adds 24 nodes (out of desired 30), then complains only about the outstanding 6.

How to reproduce it (as minimally and precisely as possible):

Make cluster-autoscaler want to add more nodes than available IP addresses in subnet (kubenet networking).

Anything else we need to know?: When I changed max nodepool size from 39 down to 29 then cluster-autoscaler created extra 24 nodes, as expected. So it seems cluster-autoscaler is able to split its scale-ups to match certain limitations (max nodepool size) but not other limitations (available IP address count).

I also think that this functionality worked better back in July, as I started experiencing this issue last week after I deleted my clusters and created new ones from scratch. Maybe a regression in one of recent AKS releases?

Environment:

Kubernetes version (use kubectl version): 1.21.2
Size of cluster (how many worker nodes are in the cluster?) 35 across 4 nodepools at the time I experienced this issue last time
General description of workloads in the cluster (e.g. HTTP microservices, Java app, Ruby on Rails, machine learning, etc.) Data crunching Go and C++ applications
Others: A snippet of cluster-autoscaler log attached autoscaler.log

ghost commented 3 years ago

Hi MarekStrejczek-TomTom, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

justindavies commented 3 years ago

Hi there, is it a /24, or you have only 24 IP addresses in the subnet available? Is this one pod per VM? The scale operation as you said is failing as the request in its entirety will fail to be desired state.

MarekStrejczek-TomTom commented 3 years ago

It's a /26 subnet (64 IP addresses out of which 5 are reserved by Azure) with 24 available addresses left. There are several pods per node, but we need only one IP address per node (kubenet networking, not CNI). I think a smart choice by autoscaler in this case would be to split the scale-up operation. My impression is that AKS used to split scale-ups as I don't recall having this problem before summer this year - but maybe I was simply lucky.

joaguas commented 3 years ago

Hi @MarekStrejczek-TomTom can't you just set CAS max nodes to stay within the the subnet limits? To make things more predictable/manageable plan ahead the numbers of ips necessary for internal services or have the LB take IPs from a different subnet for services.

I believe the scale up behavior has been the same - the scale up plan validates the target node count overall, doesn't scale/plan/check node per node or in batches

MarekStrejczek-TomTom commented 3 years ago

We use a few nodepools that host different workloads. They are all in the same subnet, but typically not all of them are equally busy. Therefore sum of max-sizes for all our nodepools is intentionally greater than subnet capacity. That would be ok as usually not all nodepools are fully busy at the same time. However with cluster-autoscaler not willing to partially execute scale-ups based on available IP addresses this becomes an issue. Of course allocating a larger subnet is a workaround (this is what we did to address the problem in short term - switched from /26 to /25 subnet) but as our workloads will get bigger (it's pilot phase currently) it feels like kicking the can down the road. At some point our corporate IT will refuse to assign larger address spaces and we will be bitten by this autoscaler limitation again.

Ability to partially execute a scale-up based on available resources instead of all-or-nothing approach sounds like a sensible thing to have. And it's not a total fantasy since cluster-autoscaler can already partially execute a scale-up when full scale-up would exceed max pool size.

Example 1: Situation: current nodepool size is 5, max nodepool size is 20. Cluster-autoscaler calculates it should add 30 nodes to fit all pods (for a total 35). What happens: since additional 30 nodes would exceed max nodepool size, cluster-autoscaler splits the scale-up and adds only 15 nodes. Sounds like a reasonable thing to do.

Example 2: Situation: current nodepool size is 5, number of IP addresses still available in the subnet is 15. Cluster-autoscaler calculates it should add 30 nodes to fit all pods (for a total 35). What happens: since additional 30 nodes would exceed subnet capacity, cluster-autoscaler does nothing. Feels inconsistent to me.

ghost commented 3 years ago

Action required from @Azure/aks-pm

marwanad commented 3 years ago

You got lucky with the previous cases because you likely had different number of pending pods in one bulk so CA managed to scale at least once for you.

And it's not a total fantasy since cluster-autoscaler can already partially execute a scale-up when full scale-up would exceed max pool size.

Max pool size is a different case because max pool size is a core CA construct while subnets are a cloudprovider issue. CA doesn't do partial scale ups today (apart from the max count case). The cloud provider interfaces don't have a way today of telling the CA core code that "I have executed a partial scale up, try and satisfy the remaining node count from another nodepool or just dont try all together because we are at subnet capacity". I am experimenting with a proposal for that that will start with splitting to respect SKU quotas.

MarekStrejczek-TomTom commented 3 years ago

You got lucky with the previous cases because you likely had different number of pending pods in one bulk so CA managed to scale at least once for you.

Agreed, I could've just be lucky.

Max pool size is a different case because max pool size is a core CA construct while subnets are a cloudprovider issue. CA doesn't do partial scale ups today (apart from the max count case). The cloud provider interfaces don't have a way today of telling the CA core code that "I have executed a partial scale up, try and satisfy the remaining node count from another nodepool or just dont try all together because we are at subnet capacity". I am experimenting with a proposal for that that will start with splitting to respect SKU quotas.

Understood - it's CA limitation, so it's a feature request that will affect:

K8S cluster-autoscaler (upstream) - to extend interface.
AKS - to utilize the extended interface, when it is available.

Do I understand it correctly that you are going to raise a request with upstream CA?

MarekStrejczek-TomTom commented 2 years ago

New finding - cluster-autoscaler also doesn't split scale-ups to match available per-subscription vCPU quota.

ghost commented 2 years ago

Action required from @Azure/aks-pm

ghost commented 2 years ago

Issue needing attention of @Azure/aks-leads

Azure / AKS

Autoscaler does not split scale-ups to match available IP addresses #2531