[Bug] Cluster Deletion fails with "Error: deadline surpassed waiting for AWS load balancers to be deleted"

fbuchmeier-abi commented 7 months ago

What were you trying to accomplish?

I'm trying to delete a eksctl managed cluster that contains AWS Application Loadbalancers managed by the aws-lb-controller (https://kubernetes-sigs.github.io/aws-load-balancer-controller).

What happened?

Cluster deletion times out with the error below:

"cmd": [
        "eksctl",
        "delete",
        "cluster",
        "--region",
        "eu-central-1",
        "--name",
        "sandbox",
        "--wait"
    ],
}

STDOUT:

2024-02-09 20:02:45 [ℹ]  deleting EKS cluster "sandbox"
2024-02-09 20:02:46 [ℹ]  will drain 0 unmanaged nodegroup(s) in cluster "sandbox"
2024-02-09 20:02:46 [ℹ]  starting parallel draining, max in-flight of 1
2024-02-09 20:02:46 [ℹ]  deleted 0 Fargate profile(s)
2024-02-09 20:02:47 [✔]  kubeconfig has been updated
2024-02-09 20:02:47 [ℹ]  cleaning up AWS load balancers created by Kubernetes objects of Kind Service or Ingress

STDERR:

Error: deadline surpassed waiting for AWS load balancers to be deleted: k8s-sharedtools-5732128751

How to reproduce it?

Deploy a new EKS cluster (I used 1.28) with eksctl >= 0.144.0 and the vpc-cni addon
Provision the aws-lb-controller as described in the docs: https://kubernetes-sigs.github.io/aws-load-balancer-controller/v2.7/deploy/installation/
Set up an ingress referencing an Application Loadbalancer. In my case, I am using annotations on the Ingress object:
```
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
annotations:
    kubernetes.io/ingress.class: alb
```
wait until the loadbalancer has been successfully created

Delete the EKS cluster

eksctl delete cluster --region eu-central-1 --name sandbox --wait

Anything else we need to know?

According to my research, the problem occurs because the AWS VPC CNI (aws-node daemonset) is deleted prior to the deletion of associated Kubernetes services and ingress objects. Deleting the CNI daemonset means that the aws-lb-controller pods fail to process the finalizers for these objects. The objects then get stuck and can not be deleted in Kubernetes.

For me the cluster deletion process is like follows:

VPC CNI gets deleted: https://github.com/aaroniscode/eksctl/blob/main/pkg/actions/cluster/owned.go#L95
Shared resources get deleted: https://github.com/aaroniscode/eksctl/blob/main/pkg/actions/cluster/owned.go#L105
Shared resources include AWS LB: https://github.com/aaroniscode/eksctl/blob/main/pkg/actions/cluster/delete.go#L63
AWS LB now (since PR: https://github.com/eksctl-io/eksctl/pull/6389) include deletion of AWS LB Controller managed resources: https://github.com/aaroniscode/eksctl/blob/08bd92c91037ca21ec18c04277d9d6ba4d21d704/pkg/elb/cleanup.go#L96C2-L96C18

This issue is happening for me since the upgrade to >= 0.144: https://github.com/eksctl-io/eksctl/releases/tag/v0.144.0 and was probably introduced with: https://github.com/eksctl-io/eksctl/pull/6389

Versions

eksctl info
eksctl version: 0.169.0
kubectl version: v1.24.10
OS: linux

Best regards, Florian.

github-actions[bot] commented 6 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

sohrabjs commented 1 month ago

Try the instructions here: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-delete.html

These are the steps to delete an Application Load Balancer:

If you have a CNAME record for your domain that points to your load balancer, point it to a new location and wait for the DNS change to take effect before deleting your load balancer. Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/ On the navigation pane, under LOAD BALANCING, choose Load Balancers. Select the load balancer, and then choose Actions, Delete. When prompted for confirmation, choose Yes, Delete.

I have got same error but when I deleted load balancer from amazon console directly and then I ran again the command $ eksctl delete cluster --name --region us-east-1

It deleted successfully.

fbuchmeier-abi commented 1 month ago

Thanks, that sounds reasonable. For the time being I've implemented a similar routine where I get all Ingress and Service resources from the cluster first, filter them for any that are related to the aws-loadbalancer-controller(*) and then delete the associated Kubernetes resource. Only when they have been successfully deleted I continue to delete the cluster.

(*)

Services can be of spec.loadBalancerClass == 'service.k8s.aws/nlb
Ingresses can be of spec.ingressClassName == 'alb'
or they can have the following annotation metadata.annotations."kubernetes.io/ingress.class" == 'alb'

eksctl-io / eksctl