Closed palmerabollo closed 4 years ago
@palmerabollo is this a feature request for AKS or AKS Engine?
What's your recommendation for deploying large clusters (e.g. 300 nodes) with private instances using AKS?
AKS questions should be opened at https://github.com/Azure/AKS/issues
@CecileRobertMichon it's a request for aks-engine (sorry, I've edited the issue to make it more clear). Are you able to create clusters with more than 64 nodes in a VMSS with aks-engine? Thanks.
To enable configuration of allocatedOutboundPorts
, it can be added here: https://github.com/Azure/aks-engine/blob/907ecd9c7f6ba9653bd9b93c1bb86477bd42117a/pkg/engine/loadbalancers.go#L245-L256
@gtracer for awareness
Seeing the exact same problem on a AKS-engine cluster with the following configurations: AKS-engine version : 0.42.0 K8s version: 1.15.3 Current node count : 70 Symptoms: exact same error in the autoscaler. Autoscaler no longer functional and lot of "Pending" workloads
We were testing with high number of agent nodes today (75 agent nodes), and we were randomly getting this error when I deploy the ARM template to create a cluster (VMSS was failing to be provisioned).
We were using aks-engine forked from this commit: 509a4db4a6d1db51115fcb22aea8df5474cd13af
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
this is still an issue.
This is still an issue, and mostly affects deployments with more than 100 nodes. Since there is a recommendation to use multiple node pools each with up to 100 nodes, we can add multiple outbound IPs for each VMSS using the following command:
az vmss create --resource-group "$resource_group" --name $agent_pool_name \ --image "$image_id" \ --admin-user "$admin_username" \ --ssh-key-value "$public_key" \ --instance-count $agent_instance_count \ --nsg "$nsg_name" \ --public-ip-address "$ip_name" \ --public-ip-address-dns-name "$dns_name" \ --vnet-name "$vnet_name" \ --subnet "$subnet_name" \ --lb "$lb_name" \ --backend-pool-name $agent_pool_name_be
As a result, we'll have a single LB with multiple outbound rules, for each VMSS and a dedicated IP.
az network lb outbound-rule list --resource-group "$resource_group" --lb-name "$lb_name" -o table
AllocatedOutboundPorts | EnableTcpReset | IdleTimeoutInMinutes | Name | Protocol | ProvisioningState | ResourceGroup |
---|---|---|---|---|---|---|
0 | True | 30 | LBOutboundRule_$agent_pool_name[0] | All | Succeeded | $resource_group |
0 | True | 30 | LBOutboundRule_$agent_pool_name[1] | All | Succeeded | $resource_group |
@Michael-Sinz can you share with @dmeytin any gotchas w/ respect to ditching the 100 node per-VMSS limit? You are running clusters w/ many more than 100 nodes on a single VMSS, correct?
Also @Michael-Sinz, in a general sense, have you seen the errors posted by the @palmerabollo in your large clusters?
@palmerabollo The outbound ports thing depends heavily on the number of public IP addresses you have in the cluster and the amount of outbound you do.
The standard load balancer NAT/SNAT design has it allocating number of streams (and thus ports) in a "fixed" scenario per VM. In older aks-engine, the API it was using in Azure would get a default of "0" which means dynamic but in a middle set, the API version was revised such that the default is a fix number of ports per VM and that was 1,024 which means you quickly get limited to 63 total VMs for each IP address. (there are 64K ports in the TCP and UDP protocols so dividing gets you to that number.) A change a bit later in aks-engine changed that back to 0, which lets the load-balancer dynamically adjust the port allocation per VM.
See Azure documentation about SNAT port allocation
I am surprised that they pre-allocate the ports per VM which makes it hard to have some nodes that do a lot of outbound traffic and some that do not. But that is how Azure's systems are designed (and likely for performance reasons).
Note that there is also a difference between TCP and UDP port allocation requirements.
The documentation explains this rather well.
PS - we use the aks-engine to produce the arm template but we adjust things within it so when the change hit that caused the default to be a fixed 1,024 ports per VM, we added some additional mutation of the arm template to force the "0" setting (dynamic) - Note that this does have some issues when you, in a live cluster, scale through the different port allocation tiers and we are looking into switching to a fixed value 64 per node which lets us have ~800 nodes. The reason is that scaling from 100 nodes to 520 nodes every day in a cluster crosses three port allocation tiers and can cause stream disruption as described in the linked document.
PS - we run clusters that run hundreds of nodes in our speech and language AI products. Multiple clusters per region, across multiple regions. We scale the busy regions by hundreds of nodes every day (up and down as load goes up and down).
@dmeytin - We generally run a small number of node pools due to scaling behaviors and the fact that we generally run separate pools for separate resource specific needs (GPU specific pools vs CPU-compute pools vs Memory-intensive pools)
The total number of nodes in our clusters rarely go over 600 and currently we limit VMSS scaling to 600. These clusters are generally reliable for our use but we have had problems at times with underlying Azure resources (VMSS nodes that failed to delete due to some internal RNM/NRP issue down deep under VMSS, for example, which then caused "not dead yet" nodes that were "not fully alive" either)
We have been scaling to running just around (across all of our clusters) a massive amount of hours of speech recognition per day (over 100 years of speech audio per day!)
We do have a pain with the load balancer registration process to VMSS nodes - that has been vastly improved in 1.15.11 (we are currently on 1.15 in production).
Other problems we have found are DNS - the default of the DNS search list that kubernetes adds has turns out to cost us significantly in latency (and we are a real-time service so latency is critical). We are now deploying updates that makes sure we always use fully qualified names and no search path as the impact was massive. (To the point of hitting the up-stream DNS RPS limit and throttling from there too).
PS - I really like @dmeytin idea of an output IP/LB just for increasing SNAT scale but that brings up issues for us since we are now being asked to follow service-tag and bring-your-own-vnet solutions which depends on knowing the IP addresses outbound communications happen on so that would require them to be outbound within a service tag allocation.
In general, we have worked hard to not have as much outbound anyway since there is a cost (performance) to go off-cluster and we try to do things like in-cluster peer-caching of blob store elements, etc. So the fact that we regularly get to 32 SNAT ports per VM means that we are relatively solid in our unique stream management.
I just realized I did not say how big a single VMSS is - they get into the 300-600 node range in a number of clusters/regions and sometimes actually hit our 600 node limit for a single VMSS. Not all of our many clusters have that high of a load since some regions are not that hot/busy. As an example of that swing - I just looked at one of our US clusters and today its swing from over night low of around 100 nodes to a peak of 600 nodes during the peak usage time. (Dynamically scaled via cluster autoscaler using our configuration tricks and buffer pods and HPA and our custom cluster internal load balancer to detect load increases early.)
Describe the request
Be able to configure both the number of IPs or the allocatedOutboundPorts.
Explain why AKS Engine needs it
In a cluster with >63 nodes, different services started to fail, including the cluster-autoscaler (we’re using Kubernetes v1.13.10 and cluster-autoscaler:v1.13.6 in this environment), because of this limit:
On a fresh cluster created with aks-engine, there are some elements that affect outbound connectivity. If we're not wrong, at the default LoadBalancer:
Describe the solution you'd like
We would like to be able to automate the creation of K8S clusters with aks-engine that support hundreds of nodes. We don’t see any options in aks-engine to configure more IP addresses (it seems to be the best option) or to configure the allocatedOutboundPorts.
Describe alternatives you've considered
As a workaround, we can reduce the allocatedOutboundPorts to 512 using az cli after the cluster is created.
Additional context