Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.93k stars 293 forks source link

[BUG] AKS cluster v1.25 node drainage times out due to PDBs #3384

Open prikesh-patel opened 1 year ago

prikesh-patel commented 1 year ago

Describe the bug

Carrying out a cluster upgrade to v1.25 or a cluster destroy at v.25 fails. This is because the node drainage operation times out due to the pod disruption budgets (PDBs) owned by Azure (i.e. calico-typha, coredns-pdb, konnectivity-agent, metrics-server-pdb). To get around this, I have had to either manually delete the PDBs or manually delete the cluster nodes before running a cluster upgrade or destroy.

These steps have been carried out through Terraform.

To Reproduce

  1. Build cluster at AKS v1.24
  2. Upgrade cluster to v1.25
  3. Destroy v1.25 cluster

Expected behavior

The cluster upgrade/destroy should go through without any additional manual interventions.

Logs

Cluster Upgrades:

The following timeout occurred in Terraform after 1 hour

╷
│ Error: waiting for update of Agent Pool (Subscription: "<subscription-id>"
│ Resource Group Name: "<resource-group-name>"
│ Resource Name: "<cluster-name>"
│ Agent Pool Name: "system1"): Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
│ 
│   with module.aks.module.node_groups.module.system_node_groups["system1"].azurerm_kubernetes_cluster_node_pool.default,
│   on .terraform/modules/aks/modules/node-groups/modules/node-group/main.tf line 1, in resource "azurerm_kubernetes_cluster_node_pool" "default":
│    1: resource "azurerm_kubernetes_cluster_node_pool" "default" {
│ 
╵
╷
│ Error: waiting for update of Agent Pool (Subscription: "<subscription-id>"
│ Resource Group Name: "<resource-group-name>"
│ Resource Name: "<cluster-name>"
│ Agent Pool Name: "system3"): Future#WaitForCompletion: context has been cancelled: StatusCode=200 -- Original Error: context deadline exceeded
│ 
│   with module.aks.module.node_groups.module.system_node_groups["system3"].azurerm_kubernetes_cluster_node_pool.default,
│   on .terraform/modules/aks/modules/node-groups/modules/node-group/main.tf line 1, in resource "azurerm_kubernetes_cluster_node_pool" "default":
│    1: resource "azurerm_kubernetes_cluster_node_pool" "default" {
│ 
╵

The following status message was obtained from the activity log in azure. It seems PDB (calico-typha) related.

"{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"ReconcileVMSSAgentPoolFailed\",\"message\":\"Drain of aks-system1-...-vmss000000 did not complete: Too many req pod calico-typha-... on node aks-system1-...-vmss000000: calico-system/calico-typha-... blocked by pdb calico-typha with unready pods [calico-system/calico-typha-...]. See http://aka.ms/aks/debugdrainfailures\"}]}}"

Cluster Destroys:

The following status message was obtained from the activity log in azure. It seems PDB (calico-typha & metrics-server-pdb) related.

╷
│ Error: waiting for update of Agent Pool (Subscription: "<subscription-id>"
│ Resource Group Name: "<resource-group-name>"
│ Resource Name: "<cluster-name>"
│ Agent Pool Name: "workers2"): Code="DeleteVMSSAgentPoolFailed" Message="Drain of aks-workers2-...-vmss000000 did not complete: Too many req pod calico-typha-... on node aks-workers2-...-vmss000000: calico-system/calico-typha-... blocked by pdb calico-typha with unready pods [calico-system/calico-typha-...calico-system/calico-typha-...]. See http://aka.ms/aks/debugdrainfailures"
│ 
│ 
╵
╷
│ Error: waiting for update of Agent Pool (Subscription: "<subscription-id>"
│ Resource Group Name: "<resource-group-name>"
│ Resource Name: "<cluster-name>"
│ Agent Pool Name: "workers1"): Code="DeleteVMSSAgentPoolFailed" Message="Drain of aks-workers1-...-vmss000000 did not complete: Too many req pod metrics-server-... on node aks-workers1-...-vmss000000: kube-system/metrics-server-... blocked by pdb metrics-server-pdb with unready pods [kube-system/metrics-server-...-system/metrics-server-...]. See http://aka.ms/aks/debugdrainfailures"
│ 
│ 
╵

However, on other attempts, I have received errors based on the following PDBs

Environment (please complete the following information):

Additional context This has been carried out through Terraform.

ghost commented 1 year ago

Action required from @Azure/aks-pm

ghost commented 1 year ago

Issue needing attention of @Azure/aks-leads

justindavies commented 1 year ago

@paulgmiller any ideas why this would be happening on our pods

mloskot commented 1 year ago

I'm not sure if this is related, but I have hit very similar issue while trying this command (I have Windows nodes):

az aks update -g MyRG -n MyAKS --windows-admin-password *****

which failed with

(CreateVMSSAgentPoolFailed) Drain of akswin1000002 did not complete:
  Too many req pod mloskot-myapp2-776dd6cf49-hqfvb on node akswin1000002: applications/mloskot-myapp2-776dd6cf49-hqfvb blocked by pdb mloskot-myapp2 with unready pods [].
See http://aka.ms/aks/debugdrainfailures
Code: CreateVMSSAgentPoolFailed
Message: Drain of akswin1000002 did not complete: 
  Too many req pod mloskot-myapp2-776dd6cf49-hqfvb on node akswin1000002: applications/mloskot-myapp2-776dd6cf49-hqfvb blocked by pdb mloskot-myapp2 with unready pods []. 
See http://aka.ms/aks/debugdrainfailures
kubectl get events --all-namespaces -o wide | grep akswin
default  21m  Warning   Drain                   node/akswin1000002    upgrader                Eviction blocked by Too many Requests (usually a pdb): [mloskot-myapp1-97fb5fd59-mngxr]
default  26m  Warning   Drain                   node/akswin1000002    upgrader                Eviction blocked by Too many Requests (usually a pdb): [mloskot-myapp3-59d6c6748f-mfkrz]
default  16m  Warning   Drain                   node/akswin1000002    upgrader                Eviction blocked by Too many Requests (usually a pdb): [ mloskot-myapp2-776dd6cf49-hqfvb]
default  41m  Normal    NodeSchedulable         node/akswin1000002    kubelet, akswin1000002  Node akswin1000002 status is now: NodeSchedulable
default  12m  Normal    NodeHasSufficientMemory node/akswin1000003    kubelet, akswin1000003  Node akswin1000003 status is now: NodeHasSufficientPID
default  12m  Normal    Drain                   node/akswin1000003    upgrader                Draining node: [akswin1000003]
default  12m  Normal    RemovingNode            node/akswin1000003    node-controller         Node akswin1000003 event: Removing Node akswin1000003 from Controller

Then, I have found and learned that a lot of customers are not aware of the reset password would reimage and drain all the node. So, that is why I'm now suspecting it may be related to this opened issue.

mloskot commented 1 year ago

Additionally to my report in https://github.com/Azure/AKS/issues/3384#issuecomment-1422318075, I have also just hit the same issue when applying Terraform plan to downsize Windows VM nodes of my AKS cluster (based on Kubernetes 1.22)

Error: waiting for the deletion of Agent Pool (Subscription: "********-****-****-****-***********")
Resource Group Name: "********"
Managed Cluster Name: "********"
Agent Pool Name: "win1"): Code="DeleteVMSSAgentPoolFailed" Message="Drain of akswin1000002 did not complete:
  Too many req pod mloskot-myapp1-5b4f75588d-5mfxc on node akswin1000002: 
    applications/mloskot-myapp1-5b4f75588d-5mfxc blocked by pdb mloskot-myapp1 with unready pods [].
    See http://aka.ms/aks/debugdrainfailures"

UPDATE 2023-03-10: On my AKS, the draining of nodes was blocked due to PodDisruptionPolicy with minimum availability of pods set to 1, while I had not expected PDB being created for my pods. Here are details of the actual issue https://github.com/gruntwork-io/helm-kubernetes-services/issues/156

ghost commented 1 year ago

Action required from @Azure/aks-pm

spkane commented 1 year ago

@miwithro @abubinski - Has there been any confirmation of this issue? And/or plan to fix it?

We are running across the same issue trying to tear down AKS 1.25.5 cluster via Terraform. The node groups timeout and it seems very likely that this is due to the 3 PDBs which get re-created every single time that you delete them.

NAMESPACE     NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
kube-system   coredns-pdb          1               N/A               0                     26s
kube-system   konnectivity-agent   1               N/A               0                     26s
kube-system   metrics-server-pdb   1               N/A               0                     26s
spkane commented 1 year ago

I also noticed that something was recreating these PDBs. What process/operator is doing that?

$ k get pdb -A
NAMESPACE     NAME                 MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
kube-system   coredns-pdb          1               N/A               0                     26s
kube-system   konnectivity-agent   1               N/A               0                     26s
kube-system   metrics-server-pdb   1               N/A               0                     26s

$ k delete pdb -n kube-system coredns-pdb konnectivity-agent metrics-server-pdb
poddisruptionbudget.policy "coredns-pdb" deleted
poddisruptionbudget.policy "konnectivity-agent" deleted
poddisruptionbudget.policy "metrics-server-pdb" deleted

$ k delete pdb -n kube-system coredns-pdb konnectivity-agent metrics-server-pdb
Error from server (NotFound): poddisruptionbudgets.policy "coredns-pdb" not found
Error from server (NotFound): poddisruptionbudgets.policy "konnectivity-agent" not found
Error from server (NotFound): poddisruptionbudgets.policy "metrics-server-pdb" not found

$ k delete pdb -n kube-system coredns-pdb konnectivity-agent metrics-server-pdb
poddisruptionbudget.policy "coredns-pdb" deleted
poddisruptionbudget.policy "konnectivity-agent" deleted
poddisruptionbudget.policy "metrics-server-pdb" deleted
klibr007 commented 1 year ago

@spkane any news ? I'm running through the same problem can't terraform destroy, pdbs keep getting recreated after delete

YichenTFlexciton commented 1 year ago

@spkane Hi! Is there any progress on this? We are about to upgrade our cluster to 1.25 and would very much appreciate any guidance on this!

spkane commented 1 year ago

@klibr007 @YichenTFlexciton I am just another person experiencing this issue. I know as much as you do about wether Microsoft has a fix in the works.

mkocaks commented 10 months ago

Hit this issue a couple of times upgrading from 1.24.10 to 1.25.6 - intermittent and is a headache to put right... MSFT any comments ????

akishorekumar commented 9 months ago

Any workaround for this ?

mloskot commented 9 months ago

@akishorekumar

Any workaround for this ?

Yes, read the thread above, find and follow https://github.com/Azure/AKS/issues/3384#issuecomment-1560037743 to delete PDB-s manually.

nb-git commented 8 months ago

We have the same issue here upgrading an AKS cluster from v1.25.6 to v1.26.6 using terraform. Deleting the pds manually made the upgrade work.

On the first try the upgrade failed and left the node in a failed state. To fix the node we wanted to use the scale command. But this also failed as the node could not be drained. It also was successful after deleting the pdbs manually. That's not a convenient solution at all.

spkane commented 7 months ago

@miwithro @abubinski @justindavies @paulgmiller - Any word on this? From what I can tell, there has not been any real followup response from Microsoft.

ajoskowski commented 1 week ago

What about 1 hour timeout? Does it mean that if I have many nodes with long time duration of shutting the application I always encounter on the terraform timeout? Imagine that I have 10 nodes where I have 1 app replica per node which takes 10 minutes to be recreated with PDB which allows to recreate 1 pod at the same time. It means that I need at least 100 minutes to upgrade kubernetes nodepool which is more than 60 minutes. Does this timeout refer to the specific node or to the nodepool?