[aws-eks] Stacks with EKS clusters fail tear down due to dangling ELB/ALB holding networking resources

ten-lac commented 3 years ago

:question: General Issue

The tear down process of a stack runs into race conditions when Kubernetes Operators and Controllers are involved that manages external deployment of resources. As an example, say we have an ALB Ingress Controller deployed through a Helm chart. Following that, we deploy a couple Ingress resources which the ALB Ingress Controller will create ALBs for. When the stack is removed, the removal of k8s resources often orphans that cloud resource equivalents, which causes the stack to fail cleanup because these orphaned resources leave bread crumbs that blocks things like SG/VPC/ENI removals. Is there a way to clean up resources properly when using CDK with operator and controllers? I've tried separating K8s resources (helmchart/manifest) into a separate stack so I could manually invoke a sleep between the cdk destroy command. But ran into trouble even separating the stacks out due to the circular dependencies between the cluster and these K8s overlay resources.

A side note. I'm aware that ALB Ingress Controller introduced finalizers into the Ingress resource it manages. This means the resource doesn't delete from the K8s control plane until the Ingress Controller has removed all AWS resources which is good. Maybe the aws-eks resource delete mechanism doesn't wait?

These are some scenarios that I've found which causes race conditions during Stack teardown.

ALB Ingress Controller
- Removal of Ingress object causes race condition between VPC deletion and ALB where ALB is not fully removed by the time VPC is being removed as well. Often I see ENI breadcrumbs that fails the VPC removal.
- Removal of ALB Ingress Controller before the Ingress object removal have been fulfilled in processing. This shows itself with a fully intact ALB which causes VPC deletion failures.
External DNS Controller
- The External DNS Controller might be deleted before it has had time to process fulfillment for proxy resources that were decorated with dns entries. I often see Route53 zone with left over CNAME/A/TXT records owned by External DNS Controller.
Any Operator/Deployment
- When resources are removed all at once, the kubelets have not had enough time to delete the supporting pods for the resource sets. But because acknowledgement from the K8s control plane has been sent that deletion of those resources is recorded, Cloudformation acknowledges as a fulfilled state. This often lets Cloudformation follow up with deleting of the managed worker pools. When this happens, the pods that weren't deleted leaves orphaned ENI since the EC2 supporting them no longer exist. This is possible when using the CNI that supports AWS VPC native networking.

iliapolo commented 3 years ago

Hi @ten-lac - Thanks for reporting this! We have seen this happen a few times now and are considering how to address this.

I've tried separating K8s resources (helmchart/manifest) into a separate stack so I could manually invoke a sleep between the cdk destroy command. But ran into trouble even separating the stacks out due to the circular dependencies between the cluster and these K8s overlay resources.

As far as this workaround, we know that circular dependencies can happen quite often when working with multiple stacks. In fact, there is PR that addresses some of these scenarios. I'm interested specifically in the cases you encounter these circular dependencies, could you share a few snippets that you've tried that suffer from this?

Regardless, having to split out to a different stack and manually orchestrate the destruction with sleep is definitely not a solution we want to land on.

Maybe the aws-eks resource delete mechanism doesn't wait?

Indeed, when a resource is deleted, we call kubectl delete, which is an async operation. The solution is probably going to be using kubectl wait.

Stay tuned :)

ten-lac commented 3 years ago

@iliapolo. Here is the code snippet that would cause a circular dependency.

node@12
cdk@1.6
language=js

const { App, Stack } = require('@aws-cdk/core');
const { Cluster, KubernetesManifest } = require('@aws-cdk/aws-eks');

class MyApp extends App {
    constructor(scope, id, props) {
        super(scope, id, props);

        const { cluster } = new Underlay(this, 'MyUnderlay');
        new Overlay(this, "MyOverlay", { cluster });
    }
}

class Underlay extends Stack {
    constructor(scope, id, props) {
        super(scope, id, props);

        const cluster = new Cluster(this, 'EKS', {
            clusterName: 'fake-cluster',
            version: KubernetesVersion.V1_17
        });

        Object.assign(this, { cluster });
    }
}

class Overlay extends Stack {
    constructor(scope, id, props) {
        super(scope, id, props);

        const { cluster } = props;

        const manifest1 = new KubernetesManifest(this, 'namespace-manifest', {
            cluster,
            manifest: [
                // some namespace yaml
            ]
        });

        const manifest2 = new KubernetesManifest(this, 'deployment-manifest', {
            cluster,
            manifest: [
                // some deployment yaml
            ]
        });

        manifest2.node.addDependency(manifest1);
    }
}

eladb commented 3 years ago

I ran into this today. Here are some details:

Repro

Deploy EKS cluster integ test
Delete the stack

Result

The deletion of the resource WebServiceSecurityGroupA556AEB5 failed with the following error:

resource sg-0bd8dbb175db76b34 has a dependent object (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: fb267a6d-df25-49cd-9374-e206fc0f4e8c; Proxy: null)

This cascaded to additional VPC resources.

From initial investigation the security group cannot be deleted since there are existing network interfaces that use it. The description of these interfaces is ELB a17226b063c3c41bcb79928dc423bb2f which implies that they are used by the ELB created by EKS.

I manually deleted the ELB above and these interfaces where deleted with it, releasing the dependency.

Screen Shot 2020-12-17 at 11 46 10

zxkane commented 3 years ago

I met the same.

It’s caused by the AWS load balancer asynchronously removing the ALB after the ingress is deleted via helm uninstall. It would be nice to add a watch feature on addHelmChart to wait for the resource purging.

I did it when purging an EKS deployment like below,

https://github.com/aws-samples/nexus-oss-on-aws/pull/25/files#diff-ff7374c772179f1952a1847bc9d4b490e3660b6e81e75099b97fa998814f763fR111

iliapolo commented 3 years ago

Apparently kubectl delete defaults to waiting for resource finalizers. As seen in the KubectlHandler logs:

[INFO]  2020-12-22T22:34:42.189Z    52ccb5d0-c691-4699-9887-417a3e7217cd    Running command: ['kubectl', 'delete', '--kubeconfig', '/tmp/kubeconfig', '-f', '/tmp/manifest.yaml']

[INFO]  2020-12-22T22:35:05.283Z    52ccb5d0-c691-4699-9887-417a3e7217cd    b'service "webservice" deleted\n'

Command execution takes 23 seconds, so it's obviously not returning immediately as I initially suspected. This means that CDK is doing the right thing, and it's now up to individual resource providers to handle deletion properly.

I don't see a systematic way for CDK to ensure those resources are deleted. We should handle this on a case by case basis.

Specifically for the ELB scenario described by @eladb - we will continue to investigate, I suspect there might be an issue with the EnsureLoadBalancerDeleted method in the legacy aws provider, where a failure to delete the ELB isn't propagated as a failure of kubectl delete.

@zxkane I'm wondering about the solution you mentioned, given that kubectl delete waits for resource purging, is this perhaps a helm specific issue you encountered?

zxkane commented 3 years ago

Apparently kubectl delete defaults to waiting for resource finalizers. As seen in the KubectlHandler logs:
[INFO]    2020-12-22T22:34:42.189Z    52ccb5d0-c691-4699-9887-417a3e7217cd    Running command: ['kubectl', 'delete', '--kubeconfig', '/tmp/kubeconfig', '-f', '/tmp/manifest.yaml']

[INFO]    2020-12-22T22:35:05.283Z    52ccb5d0-c691-4699-9887-417a3e7217cd    b'service "webservice" deleted\n'
Command execution takes 23 seconds, so it's obviously not returning immediately as I initially suspected. This means that CDK is doing the right thing, and it's now up to individual resource providers to handle deletion properly.

I don't see a systematic way for CDK to ensure those resources are deleted. We should handle this on a case by case basis.

Specifically for the ELB scenario described by @eladb - we will continue to investigate, I suspect there might be an issue with the EnsureLoadBalancerDeleted method in the legacy aws provider, where a failure to delete the ELB isn't propagated as a failure of kubectl delete.

@zxkane I'm wondering about the solution you mentioned, given that kubectl delete waits for resource purging, is this perhaps a helm specific issue you encountered?

Yes. I observed the ingress was deleted in few minutes later after the helm chart was uninstalled successfully.

❯ helm uninstall --timeout 15m nexus3
release "nexus3" uninstalled

~/git/Nexus-oss-on-aws-package mainline 1m 8s
❯ kubectl get ingress -A -w
NAMESPACE   NAME                    HOSTS                           ADDRESS                                                                          PORTS   AGE
default     nexus3-sonatype-nexus   zhy-nexus.xxx.com   k8s-default-nexus3so-9f6d2cab00-1394910236.cn-northwest-1.elb.amazonaws.com.cn   80      22m
default     nexus3-sonatype-nexus   zhy-nexus.xxx.com   k8s-default-nexus3so-9f6d2cab00-1394910236.cn-northwest-1.elb.amazonaws.com.cn   80      23m
^C
~/git/Nexus-oss-on-aws-package mainline 4m 45s
❯ kubectl get ingress -A
No resources found

robertd commented 3 years ago

FWIW This guide helped me track the dependency: https://aws.amazon.com/premiumsupport/knowledge-center/troubleshoot-dependency-error-delete-vpc/

rafaelpereyra commented 3 years ago

I just found an issue that might or might not be related to this one. My Stack deploys EKS with AWS Load Balancer controller and the Target Group CRDs.

During tear down, the deletion of the CRD resource is timing out the lambda function (not failing, times out 15 minutes x 3 retries).

Attached the relevant CW Logs in md format.

Cloudformation reports the error as:

"Custom Resource failed to stabilize in expected time. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version."

Log.zip

avallete commented 3 years ago

Hi there,

We have an issue on our side who seems related. On our side, cdk destroy sometimes fail to remove our CloudFormation stacks because it fail to remove dependent ENI and the ENI fail to be removed because it fail to remove some SecurityGroup related to it. Sometimes, the same error appear on S3Bucket removal.

The funny thing is that the error randomly appear without changes into the cdk code. Look like some kind of race condition.

revmischa commented 2 years ago

I see this as well. My lambdas in a VPC with Aurora access consistently fail to delete because the securitygroup depends on active ENIs. When CDK deletes the lambda it fails because the securitygroup fails to delete because the ENIs are still active.

otaviomacedo commented 2 years ago

@avallete and @revmischa, this is good information. From your descriptions, it seems you're not even using EKS. Is that correct? Could any of you provide me with a small example I can use to reproduce the issue?

revmischa commented 2 years ago

I'm not using EKS just Lambda + Aurora Serverless in a VPC and using func.connections.allowToDefaultPort(db)

namgk commented 1 year ago

This is so annoying, we got the same issue.

Even though ALB is deleted, ENI won't be released right away and there's no way to tell when ???

We tried to use eksctl as well but it too, miserably fail to remove the cluster.

Nothing works!

Bonus: you can't find ALB based on tags :|

JCBSLMN commented 1 year ago

Same issue, ALB controller causes CDK destroy to hang. Have to go and manually delete resources from the console.

adriantaut commented 12 months ago

I just found an issue that might or might not be related to this one. My Stack deploys EKS with AWS Load Balancer controller and the Target Group CRDs.

During tear down, the deletion of the CRD resource is timing out the lambda function (not failing, times out 15 minutes x 3 retries).

Attached the relevant CW Logs in md format.

Cloudformation reports the error as:

"Custom Resource failed to stabilize in expected time. If you are using the Python cfn-response module, you may need to update your Lambda function code so that CloudFormation can attach the updated version."

Log.zip

@rafaelpereyra we hit the same 3x15mins timeouts during cdk destroy procedure.

Below snippet showcases our implementation. During cdk destroy we have found out the SG rules are deleted almost immediately after requesting the deletion thus there is no communication between the provider and onEvent handler lambdas that are supposed to delete the Helm Chart/Manifests.


    const cluster = new Cluster(this, 'Cluster', {
      clusterName: this.clusterName,
      version: KubernetesVersion.of(this.clusterVersion),
      defaultCapacity: AVOID_INITIAL_CAPACITY_ALLOCATION,
      vpcSubnets: [{ subnetType: SubnetType.PRIVATE_WITH_EGRESS }], // this will look for subnets with tag `aws-cdk:subnet-type=Private`
      endpointAccess: EndpointAccess.PRIVATE, // no access outside of your VPC,
      vpc,
      mastersRole,
      securityGroup: controlPlaneSG,
      albController: props.albController ?? DEFAULT_ALB_CONTROLLER,
      kubectlLayer: lambdaLayer,
    });

..................................

    // Security group applied to control plane, nodes and all pods that do not match explicit SecurityGroupPolicy
    this.clusterSG = SecurityGroup.fromSecurityGroupId(this, 'ClusterSecurityGroup', cluster.clusterSecurityGroupId, {
      allowAllOutbound: false,
    });

..................................

    // Allow nodes accessing "EKS Lambdas"
    this.clusterSG.addEgressRule(Peer.anyIpv4(), Port.tcp(443), 'Needed by EKS Lambdas');

..................................

    const launchTemplate = new CfnLaunchTemplate(...)

    const nodeGroup = this.cluster.addNodegroupCapacity(...)

..................................

    this.cluster.addHelmChart('ExternalDNS', {
      chart: 'external-dns',
      release: 'external-dns',
      version: props?.externalDNSChartVersion ?? DEFAULT_EXTERNAL_DNS_CHARTS.externalDNSChartVersion,
      repository: 'https://charts.bitnami.com/bitnami',
      namespace: 'kube-system',
      values: {...}
    });

Did somebody found a way to have an node.addDependency() or something similar to prevent this happening?

keeganmccallum commented 11 months ago

This issue is wild, seems like such a concrete, common use case. I'm currently considering wiring up a node port to an LB manually so cdk can manage it at least vs. k8s....but cmon.

bmiller-pm commented 8 months ago

Bah, first time I've tried cdk. Used a simple eks sample and destroy failed on me. Is this worth it?

adriantaut commented 8 months ago

In theory once you have a running EKS Cluster you wouldn't destroy it on a regular basis :)

we are running our EKS Clusters with CDK for more than 2 years and it's definitely worth it

bmiller-pm commented 8 months ago

Theory; create a development cluster, do stuff, destroy it. I'm going to use Terraform.

lottotto commented 4 months ago

I was able to successfully destroy the file after using this setting. https://docs.aws.amazon.com/cdk/api/v2/docs/aws-cdk-lib.aws_eks-readme.html#:~:text=Every%20Kubernetes%20manifest,be%20done%20explicitly.

I use eks in conjunction with argocd. By adding finalizers appropriately and setting the manifest or helm chart order with considering the order in which they are deleted, even when ingress is expanded with the app of apps pattern in argocd, I can now be correctly destroyed.

off cource, alb ingress for argocd is also being applied together.

I'm sorry if my solution didn't fit this issue.

aws / aws-cdk