Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 307 forks source link

[Question] Upgrading AKS Cluster to 1.31.1 (Preview) broke the cluster #4607

Open sanket-t-shah opened 1 week ago

sanket-t-shah commented 1 week ago

Describe scenario I had a cluster running on 1.30.5. I initiated cluster upgrade to 1.31.1 (Preview). Post this, the cluster has moved to failed state. I did tried to add more nodepools, they get added successfully but Ready Nodes count is zero. I also tried to scale existing nodepools, stop and start them - that resulted in node pools being visible in Azure Portal but with zero active nodes.

Question How can I recover cluster and get my services back online? I'd prefer not to re-create AKS cluster as this cluster has lot of important deployments running.

PixelRobots commented 1 week ago

Hello,

Have you tried to reconcile the cluster? You can use the following command to do that if you have not yet:

az aks update -g MyResourceGroup -n MyManagedCluster

If that has not resolved the issue, you can check the diagnose and solve section under the AKS blade in the portal.

If it is still down I would suggest you get a support ticket opened for this one.

Thanks Richard

sanket-t-shah commented 1 week ago

I did tried all of that but nothing helped. I'll open support ticket in that case.

Thanks for the quick reply though.

lareeth commented 1 week ago

You should be able to add nodes using 1.30.5 which will be compatible with the 1.31 API until this issue is resolved

sanket-t-shah commented 1 week ago

@lareeth - This worked for me. Thanks for the solution. 👍

So this seems to be a bug on Microsoft side - as AKS is not able to identify Nodepools with newer version.

gevraud commented 1 week ago

Same here.

Upgraded from 1.30.5 to 1.31.1. system nodepool nodes don't come in "ready" state.

and I got this error : IMDS query failed, exit code: 28... in logs

I created another system nodepool with the 1.30.5 version and nodes come in ready state.

alexku7 commented 3 days ago

The same issue here I had to open a ticket to Azure support

So embarrassing time after time in Azure

k-koleda commented 3 days ago

We faced the same issue when we updated our development cluster from 1.30.5 to 1.31.1. As described here, I created new pools, and they are in a ready state cluster back to normal operation, but now I am facing a failed state for many operations. As I understand it, I now have control plane 1.31.1 and nodes 1.30.5. Does Azure have any mechanism to roll back the update?