Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 312 forks source link

[Question] How to recover from provision state failed at Node Pool #4674

Open jkroepke opened 6 days ago

jkroepke commented 6 days ago

Describe scenario We are running multiple AKS and we have weekly automatic patching enabled.

Some of our critical pod have a PDB configured.

In some rate condition, the automatic update failed. The reason is that there is a timeout. A configured PDB, deny to drain a node.

Question

To recover from that situation, i have to manually restart the pod. Thats fine.

However. The node pool remains in Failed state, including the extra nodes remains as well.

How I can recover from that state? How I can re-trigger the automatic update?

One solution is manually delete the old VM from the VMSS. But thats kinda tricky on Node Pools which large amount of nodes.

JoeyC-Dev commented 1 day ago

az aks update -n $aks -g $rG

No other arguments/parameters.

jkroepke commented 1 day ago

So is no Portal Experience, right? I can't trigger update at portal, if its on the latest version.

JoeyC-Dev commented 1 day ago

So is no Portal Experience, right? I can't trigger update at portal, if its on the latest version.

From document, yes. Maybe someone else knows it is hidden somewhere. Image

https://learn.microsoft.com/troubleshoot/azure/azure-kubernetes/availability-performance/cluster-node-virtual-machine-failed-state

jkroepke commented 1 day ago

Thanks, I will try that on next incident!