[BUG] Cluster Autoscaler Backoff State due to NSG conflicts

auswells commented 1 month ago

Describe the bug Cluster Autoscaler occasionally fails to scale up nodes in the expected amount of time due to suspected NSG conflicts. The VMSS activity log shows:

"{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"RetryableError\",\"target\":\"15\",\"message\":\"A retryable error occurred.\"}]}}"

These errors have occurred a handful of times across multiple AKS clusters running in different regions (K8s Version 1.27.9 at the time of this writing)

We have logged a support case and were asked to open an issue over here with their findings below:

When multiple VMSS scale up at the same time and submit PutNetworkSecurityGroupOperation requests, the first one succeeds and the others submitted while the first is in progress are marked as failures causing the autoscaler to backoff.

Operation 1: SUCEEDED

Start Time: 3/31/2024 9:42:45 PM

Complete Time: 3/31/2024 9:42:52 PM

Resource: nodepool1-vmss

Operation 2: FAILED

Start Time: 3/31/2024 9:42:47 PM

Complete Time: 3/31/2024 9:42:52 PM

Resource: nodepool2-vmss

To Reproduce Once I get a repeatable example working I will circle back here with updates.

Expected behavior The PutNetworkSecurityGroupOperation operations are automatically retried rather than marked as failures or have the backoff period for the autoscaling profile configurable rather than waiting 30 minutes for it to resume (https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler-overview#node-pool-in-backoff)

Screenshots If applicable, add screenshots to help explain your problem.

Environment (please complete the following information):

Kubernetes 1.27.9
Azure CNI

Additional context In our clusters we deploy multiple instances of an application that has pods scheduled across different nodepools. We rely on autoscaling to bring nodes up/down based on whether our application is in a started/stopped/scaling state. Each instance of the application also has it's own LoadBalancer service with different loadBalancerSourceRange settings for accessing the application. When the loadbalancers are provisioned or the loadBalancerSourceRanges are updated, there are additional requests to the NSG that may also be interfering with the VMSS PutNetworkSecurityGroupOperation requests.

jackfrancis commented 3 weeks ago

@tallaxes @gandhipr I'm not sure there's anything we could to in the cluster-autoscaler runtime itself to improve this outcome? Is the https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler-overview#node-pool-in-backoff feature able to be disabled on a per-cluster basis?

tallaxes commented 3 weeks ago

Autoscaler uses exponential backoff, with initial-node-group-backoff-duration of 5min and max-node-group-backoff-duration of 30min. AKS currently does not expose these for customization; and it cannot be disabled per-cluster. So exposing these for customization would be one optoin. However, it is not clear to me why in this case one would observe 30min (the maximum) backoff; it seems one of the earlier retries should succeed.

microsoft-github-policy-service[bot] commented 1 week ago

Hi there :wave: AKS bot here. This issue has been tagged as needing a support request so that the AKS support and engineering teams have a look into this particular cluster/issue.

Follow the steps here to create a support ticket for Azure Kubernetes Service and the cluster discussed in this issue.

Please do mention this issue in the case description so our teams can coordinate to help you.

Thank you!

palma21 commented 1 week ago

Not something we're able to reproduce and likely something that might be cause by some specific infra issue, please open a support ticket so we may follow up, feel free to share the number here.

Azure / AKS

[BUG] Cluster Autoscaler Backoff State due to NSG conflicts #4224