Azure / aks-engine

AKS Engine: legacy tool for Kubernetes on Azure (see status)
https://github.com/Azure/aks-engine
MIT License
1.03k stars 522 forks source link

Be able to configure both the number of IPs or the allocatedOutboundPorts #2377

Closed palmerabollo closed 4 years ago

palmerabollo commented 4 years ago

Describe the request

Be able to configure both the number of IPs or the allocatedOutboundPorts.

Explain why AKS Engine needs it

In a cluster with >63 nodes, different services started to fail, including the cluster-autoscaler (we’re using Kubernetes v1.13.10 and cluster-autoscaler:v1.13.6 in this environment), because of this limit:

W1128 14:49:18.684311 1 utils.go:457] Failed to remove node azure:///subscriptions/REDACTED/resourcegroups/REDACTED/providers/microsoft.compute/virtualmachinescalesets/REDACTED-vmss/virtualmachines/1298: Code="SpecifiedAllocatedOutboundPortsForOutboundRuleExceedsTotalNumberOfAvailablePorts" Message="Specified Allocated Outbound Ports 1008 for Outbound Rule /subscriptions/REDACTED/resourceGroups/REDACTED/providers/Microsoft.Network/loadBalancers/REDACTED/outboundRules/LBOutboundRule exceeds total number of available ports 920. Reduce allocated ports or increase number of IP addresses for outbound rule."

On a fresh cluster created with aks-engine, there are some elements that affect outbound connectivity. If we're not wrong, at the default LoadBalancer:

Describe the solution you'd like

We would like to be able to automate the creation of K8S clusters with aks-engine that support hundreds of nodes. We don’t see any options in aks-engine to configure more IP addresses (it seems to be the best option) or to configure the allocatedOutboundPorts.

Describe alternatives you've considered

As a workaround, we can reduce the allocatedOutboundPorts to 512 using az cli after the cluster is created.

Additional context

CecileRobertMichon commented 4 years ago

@palmerabollo is this a feature request for AKS or AKS Engine?

What's your recommendation for deploying large clusters (e.g. 300 nodes) with private instances using AKS?

AKS questions should be opened at https://github.com/Azure/AKS/issues

palmerabollo commented 4 years ago

@CecileRobertMichon it's a request for aks-engine (sorry, I've edited the issue to make it more clear). Are you able to create clusters with more than 64 nodes in a VMSS with aks-engine? Thanks.

ritazh commented 4 years ago

To enable configuration of allocatedOutboundPorts, it can be added here: https://github.com/Azure/aks-engine/blob/907ecd9c7f6ba9653bd9b93c1bb86477bd42117a/pkg/engine/loadbalancers.go#L245-L256

serbrech commented 4 years ago

@gtracer for awareness

vijaygos commented 4 years ago

Seeing the exact same problem on a AKS-engine cluster with the following configurations: AKS-engine version : 0.42.0 K8s version: 1.15.3 Current node count : 70 Symptoms: exact same error in the autoscaler. Autoscaler no longer functional and lot of "Pending" workloads

ericsuhong commented 4 years ago

We were testing with high number of agent nodes today (75 agent nodes), and we were randomly getting this error when I deploy the ARM template to create a cluster (VMSS was failing to be provisioned).

We were using aks-engine forked from this commit: 509a4db4a6d1db51115fcb22aea8df5474cd13af

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

popsikle commented 4 years ago

this is still an issue.

dmeytin commented 4 years ago

This is still an issue, and mostly affects deployments with more than 100 nodes. Since there is a recommendation to use multiple node pools each with up to 100 nodes, we can add multiple outbound IPs for each VMSS using the following command:

az vmss create --resource-group "$resource_group" --name $agent_pool_name \ --image "$image_id" \ --admin-user "$admin_username" \ --ssh-key-value "$public_key" \ --instance-count $agent_instance_count \ --nsg "$nsg_name" \ --public-ip-address "$ip_name" \ --public-ip-address-dns-name "$dns_name" \ --vnet-name "$vnet_name" \ --subnet "$subnet_name" \ --lb "$lb_name" \ --backend-pool-name $agent_pool_name_be

As a result, we'll have a single LB with multiple outbound rules, for each VMSS and a dedicated IP.

az network lb outbound-rule list --resource-group "$resource_group" --lb-name "$lb_name" -o table

AllocatedOutboundPorts EnableTcpReset IdleTimeoutInMinutes Name Protocol ProvisioningState ResourceGroup
0 True 30 LBOutboundRule_$agent_pool_name[0] All Succeeded $resource_group
0 True 30 LBOutboundRule_$agent_pool_name[1] All Succeeded $resource_group
jackfrancis commented 4 years ago

@Michael-Sinz can you share with @dmeytin any gotchas w/ respect to ditching the 100 node per-VMSS limit? You are running clusters w/ many more than 100 nodes on a single VMSS, correct?

jackfrancis commented 4 years ago

Also @Michael-Sinz, in a general sense, have you seen the errors posted by the @palmerabollo in your large clusters?

Michael-Sinz commented 4 years ago

@palmerabollo The outbound ports thing depends heavily on the number of public IP addresses you have in the cluster and the amount of outbound you do.

The standard load balancer NAT/SNAT design has it allocating number of streams (and thus ports) in a "fixed" scenario per VM. In older aks-engine, the API it was using in Azure would get a default of "0" which means dynamic but in a middle set, the API version was revised such that the default is a fix number of ports per VM and that was 1,024 which means you quickly get limited to 63 total VMs for each IP address. (there are 64K ports in the TCP and UDP protocols so dividing gets you to that number.) A change a bit later in aks-engine changed that back to 0, which lets the load-balancer dynamically adjust the port allocation per VM.

See Azure documentation about SNAT port allocation

I am surprised that they pre-allocate the ports per VM which makes it hard to have some nodes that do a lot of outbound traffic and some that do not. But that is how Azure's systems are designed (and likely for performance reasons).

Note that there is also a difference between TCP and UDP port allocation requirements.

The documentation explains this rather well.

PS - we use the aks-engine to produce the arm template but we adjust things within it so when the change hit that caused the default to be a fixed 1,024 ports per VM, we added some additional mutation of the arm template to force the "0" setting (dynamic) - Note that this does have some issues when you, in a live cluster, scale through the different port allocation tiers and we are looking into switching to a fixed value 64 per node which lets us have ~800 nodes. The reason is that scaling from 100 nodes to 520 nodes every day in a cluster crosses three port allocation tiers and can cause stream disruption as described in the linked document.

Michael-Sinz commented 4 years ago

PS - we run clusters that run hundreds of nodes in our speech and language AI products. Multiple clusters per region, across multiple regions. We scale the busy regions by hundreds of nodes every day (up and down as load goes up and down).

Michael-Sinz commented 4 years ago

@dmeytin - We generally run a small number of node pools due to scaling behaviors and the fact that we generally run separate pools for separate resource specific needs (GPU specific pools vs CPU-compute pools vs Memory-intensive pools)

The total number of nodes in our clusters rarely go over 600 and currently we limit VMSS scaling to 600. These clusters are generally reliable for our use but we have had problems at times with underlying Azure resources (VMSS nodes that failed to delete due to some internal RNM/NRP issue down deep under VMSS, for example, which then caused "not dead yet" nodes that were "not fully alive" either)

We have been scaling to running just around (across all of our clusters) a massive amount of hours of speech recognition per day (over 100 years of speech audio per day!)

We do have a pain with the load balancer registration process to VMSS nodes - that has been vastly improved in 1.15.11 (we are currently on 1.15 in production).

Other problems we have found are DNS - the default of the DNS search list that kubernetes adds has turns out to cost us significantly in latency (and we are a real-time service so latency is critical). We are now deploying updates that makes sure we always use fully qualified names and no search path as the impact was massive. (To the point of hitting the up-stream DNS RPS limit and throttling from there too).

Michael-Sinz commented 4 years ago

PS - I really like @dmeytin idea of an output IP/LB just for increasing SNAT scale but that brings up issues for us since we are now being asked to follow service-tag and bring-your-own-vnet solutions which depends on knowing the IP addresses outbound communications happen on so that would require them to be outbound within a service tag allocation.

In general, we have worked hard to not have as much outbound anyway since there is a cost (performance) to go off-cluster and we try to do things like in-cluster peer-caching of blob store elements, etc. So the fact that we regularly get to 32 SNAT ports per VM means that we are relatively solid in our unique stream management.

Michael-Sinz commented 4 years ago

I just realized I did not say how big a single VMSS is - they get into the 300-600 node range in a number of clusters/regions and sometimes actually hit our 600 node limit for a single VMSS. Not all of our many clusters have that high of a load since some regions are not that hot/busy. As an example of that swing - I just looked at one of our US clusters and today its swing from over night low of around 100 nodes to a peak of 600 nodes during the peak usage time. (Dynamically scaled via cluster autoscaler using our configuration tricks and buffer pods and HPA and our custom cluster internal load balancer to detect load increases early.)