palma21 commented 4 years ago

What is occurring?

AKS engineering is tracking a series of failures and issues related to subscription level throttling (error 429). This issue can cause disk attachment failures, scale up/down failures as well as failing any other API call to the underlying IaaS/services.

When a kubernetes cluster on Azure (AKS or no) does a frequent scale up/down or uses the cluster autoscaler (CA), those operations will result in a large number of HTTP calls that in turn exceed the assigned subscription quota leading to failure. These failures are triggered by the service side quota limits in place by the individual services and will impact loadbalancer operations, controller manager issues, etc.

When a customer or cluster begins seeing throttling at the API level the cluster may be unable to recover. This is caused by the cloud provider / controller manager for Azure. While that actor is utilized to restore the cluster to a known working state, it requires API calls - those API calls can / will be throttled as well, so manual repair by the AKS team is needed.

From the Azure Compute Resource Provider side, when the call rate limit overage is severe, it will stop accepting any requests and will block all client calls for the particular operation group until the call rate goes down significantly for some minimum defined “penalty window”. This penalty window is what’s returned in the Retry-After header. If a high number of concurrent callers rush to make the same calls again at a later time, they can cause the hard throttling to restart.

This issue can occur with any aggressive call pattern to the Azure APIs, so it is possible customers have experienced this in the past with separate, discrete root causes unrelated to the current issue. Problems of this nature are a moving target. With a few call increases in particular patches of k8s paired and the growth of more aggressive scale patterns, this error can re-appear with different causes and become more pervasive.

What will be done?

AKS has dedicated a team to deal with this pervasive issue in both tactical steps for mitigation and to accelerate the long term permanent fixes and improvements. Fixes have already been implemented upstream, eg:

[X] [DONE] Cache nil data entry on cache refresh to avoid aggressive ARM requests on node scaling down cases - https://github.com/kubernetes/kubernetes/pull/87531
[X] [DONE] Reduce calls in Cluster Autoscaler and handle race with several nodes Not Ready - https://github.com/kubernetes/autoscaler/pull/2732

AKS specific fixes have also been released to cherry pick those changes and additional mitigations have been applied to proxy the calls so the deadlock situations can no longer occur and are controlled by clients.

We expect the full rollout of these fixes by the end of the week.

What can users do?

There is no customer action required to consume any of the above fixes. To mitigate in the meantime please open a Support Ticket so the AKS team can manually repair your cluster.

BrainSlugs83 commented 4 years ago

Hi @palma21, that's great news! -- but I have a quick question: I'm seeing these kinds of issues when I provision clusters for the first time. -- is it possible for the AKS team to migrate the fix into my subscription for any cluster I create? Or does it have to be manually repaired individually, for each K8S cluster in my subscription that I wish to create?

djsly commented 4 years ago

we are also affected by the creation of the new Clusters error

Is it expected that this patch is in place already ?

jnoller commented 4 years ago

@djsly @BrainSlugs83 the reason you are seeing this on new cluster creation is that the quotas in question (http throttling/limits) are tracked on a per subscription basis. This means if you have cluster A and B on the same subscription, a single cluster may starve the others.

coreywagehoft commented 4 years ago

We ran into multiple clusters in one subscription causing throttling a few months ago. We ended up moving to 1 subscription per cluster, and we hit throttling limits again. Glad to see this issue is recognized and Azure is working to mitigate it.

@jnoller Is Azure support able to reset our "penalty window" or reset our request counters completely? We have been dancing in and out of this penalty window. We left our auto-scaling off for 12 hours and we were instantly back in it when we turned it on.

It seems like it could days or even a week to drop low enough to not risk going back into this penalty window.

jnoller commented 4 years ago

@coreywagehoft No - support can not reset the window ☹️ this is a widespread issue and our number one product priority so long term fixes involve rationalizing the limits / call usage involved.

tombuildsstuff commented 4 years ago

👋 we're an awkward use-case so I think worth calling out here

We have a resource in Terraform for provisioning AKS Clusters to confirm both the Terraform AKS Resource & Azure API behave in the manner we're expecting - as such we run acceptance tests for this every night (and when it's worked on).

This involves spinning up ~100 AKS clusters - which we used to be able to provision ~30 concurrently - however since switching to use VMSS rather than AvailabilitySets we've noticed that we're unable to provision more than 3 at a time.

Unfortunately due to the nature of these tests we're unable to have long-running clusters/use AvailabilitySets/a subscription per AKS cluster - meaning that this test time has gone up by around 10x which is causing us issues across the test suite (either we let all the AKS tests fail due the parallelisation issues [not good] - or we run them sequentially [which drags up the test run for all of our Azure tests]).

Hope that helps give some additional insight into an awkward use-case 🙃

palma21 commented 4 years ago

@BrainSlugs83 @djsly that is correct, the throttling is subscription wide when doing calls to VMSS, so you'd experience this when creating new ones as well. The patch will fix your whole affected subs.

@tombuildsstuff while this patches would def help your case, I think I'd have to review your model better to understand if it would solve it, because in the end if you're doing a lot of GETs from TF to check the cluster status, these translate to VMSS GETs and you could be throttled there (we've seen this in some cases where it was the clients causing this). Do you have a ticket opened for this where we can discuss?

palma21 commented 4 years ago

We are currently rolling out the fixes to all impacted subscriptions with existing support tickets. As planned the fixes will be worldwide by end of week.

tombuildsstuff commented 4 years ago

@palma21 the support ticket we've got open is 120010124000024 (cc @katbyte)

nninja94 commented 4 years ago

Hi, we are affected as well. Support ticket number is: 120012824002666 Thank you!!

palma21 commented 4 years ago

This is rolled out to all regions, and the above cases all have it (thanks for sharing the ticket numbers!)

UnixBoy1 commented 4 years ago

We are hitting this with 4 VMSS cluster with 3-5 workers only on the subscription. I am not sure the fix actually works. It started about the time of the role out. Is there a specific setting to K8S that is required? as we are using our own K8S install.

I was looking CloudProvidorBackOff settings and RateLimiting API calls.

tshafeev commented 4 years ago

@palma21 we still have same issue with our AKS clusters which using persistent managed disks, like a AKS cluster with 15 nodes and 30 disks scaling it down to 3-5 node could kill cluster. West US 2 region. AKS version 1.15.5 and 1.15.7

UnixBoy1 commented 4 years ago

It seems to be a VMSS issue, The rate limiting has changed somehow, I managed to get our worker working but still seeing issues with Disk claims and refresh on VMSS nodes that should mark the disk Not mounted on VMSS node. Delete PVC claims which get stuck forever due to disk being used / mounted on a specific VMSS node. when I try to look at VMSS node in the PVC link it shows error and I also get throttling errors on new claim and disk mounts.(VMSS refresh of mount status)

tshafeev commented 4 years ago

maybe it could be related to https://github.com/kubernetes/kubernetes/pull/85115 patch which introduced new disk lock mechanism?

UnixBoy1 commented 4 years ago

Not sure, The attached node link in azure portal also give an error when I try to click it on the azure disk page. something about missing instance id.

UnixBoy1 commented 4 years ago

I managed to get the disks deleted eventually by manually detaching them via AZ command. But still see issues from time to time with create pvc and delete pvc's. I think the create vmss disk limit is set very low... too low.

MartaD commented 4 years ago

Any update on this issue?

palma21 commented 4 years ago

This issue was specific to Cluster Autoscaler/AKS resource provider. Since the fixes above have all been merged I'm closing.

I see some cases above unanswered that appear related to disks. That would not be related but we're happy to look into it with the VMSS team. Do you have any support ticket numbers we can track?

CC @andyzhangx

Azure / AKS

AKS ARM/VMSS Throttling/429 errors #1413

What is occurring?

What will be done?

What can users do?