Closed eti-tme-tim closed 1 year ago
I have an older cluster with an nginx deployment for which deletion blocks with the same state. @eti-tme-tim thanks for finding out the workaround.
@TiberiuGC Any traction on this issue?
Hi @eti-tme-tim . I'm afraid I have no updates for now. We have a reduced capacity in this period and needed to prioritise other work.
Hi @eti-tme-tim - It seems that this addon comes with a default Pod Disruption Budget policy, configured as follows
eksctl % kubectl get pdb -A
NAMESPACE NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
kube-system ebs-csi-controller N/A 1 1 11m
And since there are two aws-ebs-csi-driver
pods and this rule allows only 1 to be unavailable at a time, the other ends up with the error we’re seeing (i.e. 1 pods are unevictable from node …
). The solution is to use the --disable-nodegroup-eviction
flag while deleting the cluster. This will bypass checking Pod Disruption Budget policies.
I've opened a PR to document the behaviour.
What were you trying to accomplish?
Delete the cluster created by a YAML file that includes the AWS EBS CSI driver addon.
What happened?
Cluster creation works without error. PVC and PV creation via EBS CSI driver on this cluster work without error. The following pods are running when it is time to delete the cluster:
When you issue the deletion command, the workflow hangs because the compute nodes can not be drained successfully:
The list of pods while the draining node hangs (and eventually times out) are:
I've tracked the issue down to the fact that the ebs-csi-controller deployment is not getting deleted properly. As soon as I manually delete it via:
The workflow proceeds just fine and the cluster deletion completes successfully. I've also found that a timed out, failed eksctl delete command can essentially be re-run after the ebs-csi-controller deployment is deleted without issue.
How to reproduce it?
Commands:
cluster1.yaml:
Anything else we need to know?
All client side binaries and software installed via homebrew. And, I'm really hoping I doing something stupidly and simply wrong. :)
Note: with the EKS 1.23 changes related to EBS CSI (namely, needing the service role), this is the first time we've had to use an addon. We've not had cluster deletion issues with EKS 1.22 (and not explicitly declaring the addon and IAM role). Kinda seems like the addon should be deleted from the node pool before the nodes are drained (although I could envision other problems with that approach if those nodes weren't evacuated before starting the drain as in my case).
Versions