Open prikesh-patel opened 1 year ago
Action required from @Azure/aks-pm
Issue needing attention of @Azure/aks-leads
@paulgmiller any ideas why this would be happening on our pods
I'm not sure if this is related, but I have hit very similar issue while trying this command (I have Windows nodes):
az aks update -g MyRG -n MyAKS --windows-admin-password *****
which failed with
(CreateVMSSAgentPoolFailed) Drain of akswin1000002 did not complete:
Too many req pod mloskot-myapp2-776dd6cf49-hqfvb on node akswin1000002: applications/mloskot-myapp2-776dd6cf49-hqfvb blocked by pdb mloskot-myapp2 with unready pods [].
See http://aka.ms/aks/debugdrainfailures
Code: CreateVMSSAgentPoolFailed
Message: Drain of akswin1000002 did not complete:
Too many req pod mloskot-myapp2-776dd6cf49-hqfvb on node akswin1000002: applications/mloskot-myapp2-776dd6cf49-hqfvb blocked by pdb mloskot-myapp2 with unready pods [].
See http://aka.ms/aks/debugdrainfailures
kubectl get events --all-namespaces -o wide | grep akswin
default 21m Warning Drain node/akswin1000002 upgrader Eviction blocked by Too many Requests (usually a pdb): [mloskot-myapp1-97fb5fd59-mngxr]
default 26m Warning Drain node/akswin1000002 upgrader Eviction blocked by Too many Requests (usually a pdb): [mloskot-myapp3-59d6c6748f-mfkrz]
default 16m Warning Drain node/akswin1000002 upgrader Eviction blocked by Too many Requests (usually a pdb): [ mloskot-myapp2-776dd6cf49-hqfvb]
default 41m Normal NodeSchedulable node/akswin1000002 kubelet, akswin1000002 Node akswin1000002 status is now: NodeSchedulable
default 12m Normal NodeHasSufficientMemory node/akswin1000003 kubelet, akswin1000003 Node akswin1000003 status is now: NodeHasSufficientPID
default 12m Normal Drain node/akswin1000003 upgrader Draining node: [akswin1000003]
default 12m Normal RemovingNode node/akswin1000003 node-controller Node akswin1000003 event: Removing Node akswin1000003 from Controller
Then, I have found and learned that a lot of customers are not aware of the reset password would reimage and drain all the node. So, that is why I'm now suspecting it may be related to this opened issue.
Additionally to my report in https://github.com/Azure/AKS/issues/3384#issuecomment-1422318075, I have also just hit the same issue when applying Terraform plan to downsize Windows VM nodes of my AKS cluster (based on Kubernetes 1.22)
Error: waiting for the deletion of Agent Pool (Subscription: "********-****-****-****-***********")
Resource Group Name: "********"
Managed Cluster Name: "********"
Agent Pool Name: "win1"): Code="DeleteVMSSAgentPoolFailed" Message="Drain of akswin1000002 did not complete:
Too many req pod mloskot-myapp1-5b4f75588d-5mfxc on node akswin1000002:
applications/mloskot-myapp1-5b4f75588d-5mfxc blocked by pdb mloskot-myapp1 with unready pods [].
See http://aka.ms/aks/debugdrainfailures"
UPDATE 2023-03-10: On my AKS, the draining of nodes was blocked due to PodDisruptionPolicy
with minimum availability of pods set to 1, while I had not expected PDB being created for my pods. Here are details of the actual issue https://github.com/gruntwork-io/helm-kubernetes-services/issues/156
Action required from @Azure/aks-pm
@miwithro @abubinski - Has there been any confirmation of this issue? And/or plan to fix it?
We are running across the same issue trying to tear down AKS 1.25.5 cluster via Terraform. The node groups timeout and it seems very likely that this is due to the 3 PDBs which get re-created every single time that you delete them.
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
kube-system coredns-pdb 1 N/A 0 26s
kube-system konnectivity-agent 1 N/A 0 26s
kube-system metrics-server-pdb 1 N/A 0 26s
I also noticed that something was recreating these PDBs. What process/operator is doing that?
$ k get pdb -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
kube-system coredns-pdb 1 N/A 0 26s
kube-system konnectivity-agent 1 N/A 0 26s
kube-system metrics-server-pdb 1 N/A 0 26s
$ k delete pdb -n kube-system coredns-pdb konnectivity-agent metrics-server-pdb
poddisruptionbudget.policy "coredns-pdb" deleted
poddisruptionbudget.policy "konnectivity-agent" deleted
poddisruptionbudget.policy "metrics-server-pdb" deleted
$ k delete pdb -n kube-system coredns-pdb konnectivity-agent metrics-server-pdb
Error from server (NotFound): poddisruptionbudgets.policy "coredns-pdb" not found
Error from server (NotFound): poddisruptionbudgets.policy "konnectivity-agent" not found
Error from server (NotFound): poddisruptionbudgets.policy "metrics-server-pdb" not found
$ k delete pdb -n kube-system coredns-pdb konnectivity-agent metrics-server-pdb
poddisruptionbudget.policy "coredns-pdb" deleted
poddisruptionbudget.policy "konnectivity-agent" deleted
poddisruptionbudget.policy "metrics-server-pdb" deleted
@spkane any news ? I'm running through the same problem can't terraform destroy, pdbs keep getting recreated after delete
@spkane Hi! Is there any progress on this? We are about to upgrade our cluster to 1.25 and would very much appreciate any guidance on this!
@klibr007 @YichenTFlexciton I am just another person experiencing this issue. I know as much as you do about wether Microsoft has a fix in the works.
Hit this issue a couple of times upgrading from 1.24.10 to 1.25.6 - intermittent and is a headache to put right... MSFT any comments ????
Any workaround for this ?
@akishorekumar
Any workaround for this ?
Yes, read the thread above, find and follow https://github.com/Azure/AKS/issues/3384#issuecomment-1560037743 to delete PDB-s manually.
We have the same issue here upgrading an AKS cluster from v1.25.6 to v1.26.6 using terraform. Deleting the pds manually made the upgrade work.
On the first try the upgrade failed and left the node in a failed state. To fix the node we wanted to use the scale command. But this also failed as the node could not be drained. It also was successful after deleting the pdbs manually. That's not a convenient solution at all.
@miwithro @abubinski @justindavies @paulgmiller - Any word on this? From what I can tell, there has not been any real followup response from Microsoft.
What about 1 hour timeout? Does it mean that if I have many nodes with long time duration of shutting the application I always encounter on the terraform timeout? Imagine that I have 10 nodes where I have 1 app replica per node which takes 10 minutes to be recreated with PDB which allows to recreate 1 pod at the same time. It means that I need at least 100 minutes to upgrade kubernetes nodepool which is more than 60 minutes. Does this timeout refer to the specific node or to the nodepool?
Describe the bug
Carrying out a cluster upgrade to
v1.25
or a cluster destroy atv.25
fails. This is because the node drainage operation times out due to the pod disruption budgets (PDBs) owned by Azure (i.e.calico-typha
,coredns-pdb
,konnectivity-agent
,metrics-server-pdb
). To get around this, I have had to either manually delete the PDBs or manually delete the cluster nodes before running a cluster upgrade or destroy.These steps have been carried out through Terraform.
To Reproduce
v1.24
v1.25
Expected behavior
The cluster upgrade/destroy should go through without any additional manual interventions.
Logs
Cluster Upgrades:
The following timeout occurred in Terraform after 1 hour
The following status message was obtained from the activity log in azure. It seems PDB (
calico-typha
) related."{\"status\":\"Failed\",\"error\":{\"code\":\"ResourceOperationFailure\",\"message\":\"The resource operation completed with terminal provisioning state 'Failed'.\",\"details\":[{\"code\":\"ReconcileVMSSAgentPoolFailed\",\"message\":\"Drain of aks-system1-...-vmss000000 did not complete: Too many req pod calico-typha-... on node aks-system1-...-vmss000000: calico-system/calico-typha-... blocked by pdb calico-typha with unready pods [calico-system/calico-typha-...]. See http://aka.ms/aks/debugdrainfailures\"}]}}"
Cluster Destroys:
The following status message was obtained from the activity log in azure. It seems PDB (
calico-typha
&metrics-server-pdb
) related.However, on other attempts, I have received errors based on the following PDBs
calico-typha
coredns-pdb
konnectivity-agent
metrics-server-pdb
Environment (please complete the following information):
Additional context This has been carried out through Terraform.