Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 295 forks source link

Capacity errors returned on AKS create in westeurope #4185

Open Sennar19 opened 3 months ago

Sennar19 commented 3 months ago

Hi AKS Team,

this morning we started 2 AKS clusters, but one of them as this error:

image

and trying the update command:

az aks update -g -n --verbose -y

we have this error:

(AKSCapacityError) Creating a new cluster or start cluster is unavailable at this time in region westeurope. To create a new cluster, we recommend using an alternate region. For a list of all the Azure regions, visit https://aka.ms/aks/regions. Code: AKSCapacityError Message: Creating a new cluster or start cluster is unavailable at this time in region westeurope. To create a new cluster, we recommend using an alternate region. For a list of all the Azure regions, visit https://aka.ms/aks/regions.

Is there some issue in "westeurope" region? Can we apply some workaround? Notice that we can't run AKS clusters in any other region.

mperzov commented 3 months ago

same issue for me as well, spoke to Azure AKS Support team, no ETA for resolution ATM

ldecuba commented 3 months ago

same issue here. Have a cluster in failed state and trying the az aks update -g -n --verbose -y command returning same error: (AKSCapacityError) Creating a new cluster or start cluster is unavailable at this time in region westeurope. To create a new cluster, we recommend using an alternate region. For a list of all the Azure regions, visit https://aka.ms/aks/regions. Code: AKSCapacityError Message: Creating a new cluster or start cluster is unavailable at this time in region westeurope. To create a new cluster, we recommend using an alternate region. For a list of all the Azure regions, visit https://aka.ms/aks/regions. @Sennar19 @mperzov still having issues also i guess?

mperzov commented 3 months ago

@ldecuba yes, I'm still facing the issue and all I'm hearing is crickets

ldecuba commented 3 months ago

Looks like it's been resolved @mperzov as it's been running for a few minutes now and the nodes are ready but scaling up!

mperzov commented 3 months ago

Not for me, my cluster is still in a Failed state and I'm unable to reconcile due to the (AKSCapacityError) Creating or start a free tier cluster is unavailable at this time in region westeurope. error

ldecuba commented 3 months ago

Oh well I guess I am lucky then. The cluster is now back in ready state!Good luck mateSent from my iPhoneOn 28 Mar 2024, at 20:00, Mark Pertsovsky @.***> wrote: Not for me, my cluster is still in a Failed state and I'm unable to reconcile due to the (AKSCapacityError) Creating or start a free tier cluster is unavailable at this time in region westeurope. error

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

Sennar19 commented 3 months ago

Also for me, it looks like it's been resolved, but we stop our AKS no-prod clusters all evening and start all morning, so I'll check tomorrow if the problem has been really solved or not

ldecuba commented 3 months ago

Great!Sent from my iPhoneOn 28 Mar 2024, at 21:43, Matteo Cristiano @.***> wrote: Also for me, it looks like it's been resolved, but we stop our AKS no-prod clusters all evening and start all morning, so I'll check tomorrow if the problem has been really solved or not

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

mperzov commented 3 months ago

JFYI, my cluster is back after running reconcile

Sennar19 commented 3 months ago

Also for me, it looks like it's been resolved

dpuertamartos commented 3 months ago

Looks like the problems are back...

mperzov commented 3 months ago

Well that figures, in my case I stopped using the automation to stop the stage environment AKS cluster to prevent this issue from happening again. As I can't rely on Azure Support and no acceptable solution is provided by Azure Engineers yet, if you have an AKS cluster which currently running successfully, consider backing up the cluster and restore in another region. I know https://velero.io/ might provide a good opensource backup solution

ldecuba commented 3 months ago

Thanks for the update my AKS cluster is currently running without any problems so I will check for a backup/restore to a different region.Greetings Sent from my iPhoneOn 3 Apr 2024, at 13:38, Mark Pertsovsky @.***> wrote: Well that figures, in my case I stopped using the automation to stop the stage environment AKS cluster to prevent this issue from happening again. As I can't rely on Azure Support and no acceptable solution is provided by Azure Engineers yet, if you have an AKS cluster which currently running successfully, consider backing up the cluster and restore in another region. I know https://velero.io/ might provide a good opensource backup solution

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

dpuertamartos commented 3 months ago

Update: Today morning I could renconcile the cluster and it is back and running. Hopefully it won't happen again soon.

TomkhaLoL commented 3 months ago

We're currently having the same problem with 2 clusters. The first one got into this state after starting the aks manually last week. The second one has the same problem after manually stopping it. We can't do anything to reconcile out cluster. We tried changing the AKS Tier, but we're locked out atm.

We tried reconciling our cluster

az aks update --resource-group abc --name abc-aks
az resource update -n abc-aks -g abc --resource-type Microsoft.ContainerService/ManagedClusters

both commands result in the following error message

(InvalidParameter) agentPoolProfile.count was 0. It must be greater or equal to minCount:1 and less than or equal to maxCount:1000. If allowedMinCount was expected to be 0 but is 1: 1) The nodepool is a VMAS pool. 2) api Version is less than 2020-03-01. 3) The node is a system pool.
Code: InvalidParameter
Message: agentPoolProfile.count was 0. It must be greater or equal to minCount:1 and less than or equal to maxCount:1000. If allowedMinCount was expected to be 0 but is 1: 1) The nodepool is a VMAS pool. 2) api Version is less than 2020-03-01. 3) The node is a system pool.
Target: agentPoolProfile.count

Trying to scale the system node pool results in the following error message

az aks nodepool scale --cluster-name abc-aks --name system --resource-group abc --node-count 1
(ControlPlaneNotReady) Control plane is not ready. Please reconcile your managed cluster abc-aks by cmd 'az aks update' and try again.
Code: ControlPlaneNotReady
Message: Control plane is not ready. Please reconcile your managed cluster abc-aks by cmd 'az aks update' and try again.

We already tried to reach out to azure support, but the response is still pending. Is there anything we can do?

ldecuba commented 3 months ago

Yes today I am also facing the same problem.We did a cluster upgrade on Friday and afterwards all was running fine. Today it is back.Will propose to have this backed up and restored on a different region as this is getting pretty annoying!Sent from my iPhoneOn 11 Apr 2024, at 09:50, TomkhaLoL @.***> wrote: Thankfully I can share an update: 'az aks scale' seemed to fix this problem, but the trick was to do it outside of peak office hours in europe. The command was executed around 10 pm and worked just fine. I think it's still very weird behaviour and would much prefer if the aks wouldn't go into a failed state when trying to start a free cluster during low capacity times. Our clusters are usually started and stopped automatically using an Azure Automation Account and for some reason that never happened with those. It probably really is just that we automatically start/stop at very early and late hours in the day. Maybe it's a naive assumption or there are other technical difficulties that wouldn't allow this, but wouldn't it be much better if the start/stop AKS action would just be aborted without locking down the cluster and instead give you the opportunity to upgrade the AKS Tier from free to standard?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

JoeyC-Dev commented 3 months ago

Since this issue is essentially due to host capacity, I suppose Capacity Reservations will help mitigate the issue.

As mentioned in document:

As your workload demands change, you can associate existing capacity reservation groups to node pools to guarantee allocated capacity for your node pools.

ldecuba commented 3 months ago

Yes good point!Sent from my iPhoneOn 23 Apr 2024, at 10:40, Joey Chen @.***> wrote: Since this issue is basically due to host capacity, I suppose Capacity Reservations will help mitigate the issue. As mentioned in document:

As your workload demands change, you can associate existing capacity reservation groups to node pools to guarantee allocated capacity for your node pools.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

KenSpur commented 3 months ago

Same issue atm while trying to provision a Standard tier cluster

{
   "code": "AKSCapacityError",
   "details": null,
   "message": "Creating a new cluster or start cluster is unavailable at this time in region westeurope. To create a new cluster, we recommend using an alternate region. For a list of all the Azure regions, visit [https://aka.ms/aks/regions.",](https://aka.ms/aks/regions.%22,)
   "subcode": ""
}

Changing vm size from for example B4s_v2 to DS2_v2 did not have any effect. Does Capacity Reservations not only have effect on the vm scale sets used for default or extra node pools? I suspect the AKSCapacityError might be about the Azure managed part of AKS.

JoeyC-Dev commented 3 months ago

Same issue atm while trying to provision a Standard tier cluster

{
   "code": "AKSCapacityError",
   "details": null,
   "message": "Creating a new cluster or start cluster is unavailable at this time in region westeurope. To create a new cluster, we recommend using an alternate region. For a list of all the Azure regions, visit [https://aka.ms/aks/regions.",](https://aka.ms/aks/regions.%22,)
   "subcode": ""
}

Changing vm size from for example B4s_v2 to DS2_v2 did not have any effect. Does Capacity Reservations not only have effect on the vm scale sets used for default or extra node pools? I suspect the AKSCapacityError might be about the Azure managed part of AKS.

@KenSpur I tried B4s_v2, it simply does not allow to create with B series. So I guess you must forget on that series.
image And then: image

Used configuration: image image

KenSpur commented 3 months ago

Thanks @JoeyC-Dev, B series vms might indeed not be the best to use because of how they work, though at this moment i am again able to spin up a aks cluster using them.

It still seems to me that the issue lies in the provisioning of the Azure managed controle plane part of AKS, over which we as customers do not have control.

piercsi commented 3 months ago

Ran into this issue today. Didn't actually realize there were different AKS tiers (been on AKS a long time). From the failed state fixed it with:

az aks update --resource-group <rg> --name <name> --tier standard

Cluster came back up after a few minutes wait.

Note this will incur a charge of about €70/month.

piercsi commented 2 months ago

OK so I managed to seriously bork my AKS cluster. I mucked about and found out. Thankfully this is not a production cluster.

All my node pools are in a failed state and my cluster is in a failed state. It seems there's nothing I can do so have raised a support ticket.

az aks nodepool show -g <rg> --cluster-name <name> -n <nodepool> -o table

Name    OsType    KubernetesVersion    VmSize            Count    MaxPods    ProvisioningState    Mode
------  --------  -------------------  ----------------  -------  ---------  -------------------  ------
<nodep> Linux     1.28.3               Standard_D4as_v5  3        110        Failed               System

Can't add a new nodepool.

az aks nodepool add --resource-group <rg> --cluster-name <name> --name newnewpool --node-count 3 --node-vm-size Standard_D4as_v5

(ControlPlaneNotReady) Control plane is not ready. Please reconcile your managed cluster <name> by cmd 'az aks update' and try again.
Code: ControlPlaneNotReady
Message: Control plane is not ready. Please reconcile your managed cluster <name> by cmd 'az aks update' and try again.

I can't reconcile the current settings:

az aks update --resource-group <rg> --name <name>
no argument specified to update would you like to reconcile to current settings? (y/N): y

(InsufficientAgentPoolMaxPodsPerAgentPool) The AgentPoolProfile '<nodepool>' has an invalid total maxPods(maxPods per node * node count), the total maxPods(110 * 0) should be larger than 30. Please refer to aka.ms/aks-min-max-pod for more detail.
Code: InsufficientAgentPoolMaxPodsPerAgentPool
Message: The AgentPoolProfile '<nodepool>' has an invalid total maxPods(maxPods per node * node count), the total maxPods(110 * 0) should be larger than 30. Please refer to aka.ms/aks-min-max-pod for more detail.
Target: agentPoolProfile.kubernetesConfig.kubeletConfig.maxPods
aerott commented 1 month ago

Hello fellows,

I haven't been able to run my AKS cluster for over a week now. In my cluster, I use the AKS free tier for the control plane and several nodes based on B-series. Thankfully, it's not a production cluster, but I am using it for development workloads.

Is there any chance that the 'AKSCapacityError' will be resolved soon, or am I forced to switch to a paid tier?

Best regards. AErot

tmsvl commented 1 month ago

I am having the same issue, and since the cluster is in failed state, I cannot even upgrade to paid tier, so I'm in a catch-22 here.

I'm in touch with MS support on this, they confirmed there is a capacity issue and they advised to "try and start the cluster outside office hours". It's 0:30 AM in West Europe right now but still facing the same issue.

So for now, I don't have a solution.

pepihub commented 1 month ago

I am having the same issue, and since the cluster is in failed state, I cannot even upgrade to paid tier, so I'm in a catch-22 here.

I'm in touch with MS support on this, they confirmed there is a capacity issue and they advised to "try and start the cluster outside office hours". It's 0:30 AM in West Europe right now but still facing the same issue.

So for now, I don't have a solution.

Scale the virtual machine scale set of the AKS, on the VMSS scale menu to one, wait until it’s running (it’ll show as running but failed) from there it lets you change the tier to Standard.

gerwim commented 1 month ago

We've opened a ticket with Azure support too (as we couldn't change to a paid tier). Your answer, @mjms-j, fixed it, so thank you very much!

While Azure support is paid too, this is their answer:

We suggest you keep retrying the operation, and it might succeed when the region will go from Red/Orange back to Green. Especially retry at the end of the day, when other customers are stopping/deleting their clusters as capacity will be reclaimed.

You can try deploying in a different region (eg. NorthEurope instead of WestEurope). Here, you should leverage the ASC Capacity Allocation Recommender tool that shows how likely a VM/VMSS is to succeed in a specific region. This should help determine a good region with enough capacity.

Please refer this link for alternatives options that are currently available at the moment - [Troubleshoot the AKSCapacityError error code - Azure | Microsoft Learn](https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/akscapacityerror)

We deeply regret for the inconvenience caused at the moment and we really appreciate your patience and trust in us.

Ensuring capacity for our customers is a top priority for Microsoft and we are working around the clock to deliver on this. The increasing demand for Azure services is evidence of the popularity of Azure and emphasizes the need to scale up our infrastructure even more rapidly. With that in mind, we are expediting expansions and are improving our resource deployment process to respond to this strong customer demand. In fact, we are adding a significant amount of compute infrastructure monthly. We have identified several improvements on how we load-balance under a high resource usage situation, and how to trigger the timely deployment of needed resources. Furthermore, we are increasing our capacity significantly – and will continue to plan for strong customer demand across all of our regions. [This September 2021 blog post covers improvements towards delivering a resilient cloud supply chain](https://azure.microsoft.com/en-us/blog/advancing-reliability-through-a-resilient-cloud-supply-chain/) .
aerott commented 1 month ago

I managed to allocate my AKS resources. For now, I won't deallocate the my cluster out of fear that I won't be able to turn it back on again 😅 Or maybe Microsoft has already solved the resource allocation problem on their end? 😔

JoeyC-Dev commented 1 month ago

Latest TSG: https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/create-upgrade-delete/akscapacityerror

kipusoep commented 6 days ago

I ran into this today after deleting a cluster and trying to recreate it... #fml 🤬