(aws-eks): EKS Cluster destruction hanging as the cluster is destroyed before Custom::AWSCDK-EKS-KubernetesResource

chillitom commented 3 years ago

When destroying a Stack with an EKS Cluster the destruction can hang for hours as the cluster is destroyed before manifest resources are disposed.

In the case below the destroy hangs on the 'server-api' manifest resource. At the point of the hang the cluster and instance have already been destroyed.

In the case below the manifest is likely to be slow to remove as kubernetes needs to perform a couple of actions to unregister the load balancers defined in this manifest.

All other items in the stack are destroyed and the stack is hung waiting for the deletion of a resource of type Custom::AWSCDK-EKS-KubernetesResource

Reproduction Steps


        const clusterName = 'Cluster'

        const mastersRole = new iam.Role(this, 'MastersRole', { assumedBy: new iam.AccountRootPrincipal() })

        const cluster = new eks.Cluster(this,
            'KubeCluster',
            {
                clusterName,
                version: eks.KubernetesVersion.V1_18,
                coreDnsComputeType: eks.CoreDnsComputeType.EC2,
                defaultCapacity: 1,
                defaultCapacityInstance: ec2.InstanceType.of(ec2.InstanceClass.T2, ec2.InstanceSize.MEDIUM),
                vpc: props.vpc,
                vpcSubnets: [
                    { subnetType: ec2.SubnetType.PRIVATE }
                ],
                mastersRole: mastersRole,
            })

        const serviceAccount = cluster.addServiceAccount('aws-load-balancer-controller', {
            name: 'aws-load-balancer-controller',
            namespace: 'default'
        })
        const policyJson = request('GET', 'https://raw.githubusercontent.com/kubernetes-sigs/aws-load-balancer-controller/main/docs/install/iam_policy.json').getBody('utf8')

        const statements = JSON.parse(policyJson).Statement as Array<any>

        statements.forEach(statement => serviceAccount.addToPrincipalPolicy(iam.PolicyStatement.fromJson(statement)))

        // this can't be used with Fargate as it requires cert-manager to be installed
        // cert-manager fails to sign certs as the host name is wrong on Fargate instances
        // probably can use this if we are using an EC2 hosted cluster
        // helm install cert-manager jetstack/cert-manager --namespace cert-manager --version v1.1.0 --set installCRDs=true --create-namespace
        const certManagerChart = new eks.HelmChart(this,
                'cert-manager',
                {
                    cluster,
                    createNamespace: true,
                    namespace: 'cert-manager',
                    repository: 'https://charts.jetstack.io',
                    chart: 'cert-manager',
                    release: 'cert-manager',
                    values: {
                        // https://github.com/jetstack/cert-manager/blob/master/deploy/charts/cert-manager/values.yaml
                        installCRDs: true,
                    },
                    version: 'v1.1.0'
                })

        // helm install aws-load-balancer-controller eks/aws-load-balancer-controller --set serviceAccount.create=false --set serviceAccount.name=aws-load-balancer-controller --set image.tag=v2.0.1 --set clusterName=TradingCluster --set region=eu-west-2 --set vpcId=vpc-0b5ac955f7f62cca5
        var albChart = cluster.addHelmChart('ApplicationLoadBalancer', {
            repository: 'https://aws.github.io/eks-charts',
            chart: 'aws-load-balancer-controller',
            release: 'aws-load-balancer-controller',
            values:
            { // https://github.com/aws/eks-charts/blob/master/stable/aws-load-balancer-controller/values.yaml
                clusterName: cluster.clusterName,
                image: {
                    tag: 'v2.0.1',
                },
                serviceAccount: {
                    create: false,
                    name: 'aws-load-balancer-controller'
                },
                region: 'eu-west-2',
                vpcId: props.vpc.vpcId
            }
        })

        albChart.node.addDependency(serviceAccount)

        // add API load balancer
        cluster.addManifest('server-api',
            {
                "apiVersion": 'v1',
                "kind": 'Service',
                "metadata": {
                    "name": 'api',
                    "annotations": {
                        "service.beta.kubernetes.io/aws-load-balancer-type": 'nlb-ip',
                        "service.beta.kubernetes.io/aws-load-balancer-internal": 'true',
                        "service.beta.kubernetes.io/aws-load-balancer-additional-resource-tags": 'ServerApi=true'
                    }
                },
                "spec": {
                    "ports": [
                        {
                            "port": 80,
                            "targetPort": 80,
                            "protocol": 'TCP'
                        }
                    ],
                    "type": 'LoadBalancer',
                    "selector": {
                        "orleans/serviceId": 'server'
                    }
                }
            })

What did you expect to happen?

Either the manifest destruction should complete before the cluster is destroyed or the manifest destruction should be skipped altogether as the cluster no longer exists.

What actually happened?

The deployment hung trying to remove the manifest definition from a dead cluster.

Environment

CDK CLI Version : 1.75.0
Framework Version: 1.75.0
Node.js Version: 15.15.1
OS : Windows
Language (Version): TypeScript (4.1.2)

Other

This is :bug: Bug Report

iliapolo commented 3 years ago

@chillitom Thanks for reporting this.

The cluster is actually destroyed only after the manifests are deleted because of the natural dependency between them. The problem here is that the deletion of manifests is asynchronous, and we currently do not wait for them to be completed before signaling CloudFormation that the resource has been deleted.

We already have an issue for this that we plan to address to address soon.

https://github.com/aws/aws-cdk/issues/9970

github-actions[bot] commented 2 years ago

This issue has not received any attention in 1 year. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

aws / aws-cdk