aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.34k stars 3.76k forks source link

(aws eks): EKS stack deletes resources in the wrong order, causing DELETE_FAILED #18650

Open mvs5465 opened 2 years ago

mvs5465 commented 2 years ago

What is the problem?

Running cdk destroy on an EKS cluster stack always results in DELETE_FAILED.

It appears to be attempting to delete the security group before deleting the cluster, causing a failure to delete as the resource is in use.

This error returns from the control plane security group deletion in cloudformation:

resource <security group id> has a dependent object (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: <request id>; Proxy: null)

The cloudformation stack itself then fails to delete with an error like this:

The following resource(s) failed to delete: [<control plane security group name, <eks fargate profile name>].

Reproduction Steps

Define a new cluster:

new aws_eks.FargateCluster(this, id, {
        version: this.props.version,
        vpc: this.props.vpc,
        endpointAccess: EndpointAccess.PRIVATE,
        placeClusterHandlerInVpc: true,
        vpcSubnets: [{
          subnetType: SubnetType.PRIVATE_WITH_NAT
        }]
    });

then run cdk deploy. After it succeeds, run cdk destroy and the error will happen.

What did you expect to happen?

Handler should delete the EKS cluster first, and then delete the security group.

What actually happened?

Handler deletes the security group first, which fails because the resource is in use. It then causes rollback failed and/or delete failed.

CDK CLI Version

2.8.0 (build 8a5eb49)

Framework Version

No response

Node.js Version

v17.3.1

OS

MacOS Catalina 10.15.7

Language

Typescript

Language Version

No response

Other information

No response

mvs5465 commented 2 years ago

This is for a fargate cluster, it seems like the fargate profile needs to be deleted before the cluster and maybe that is what is causing this.

peterwoodworth commented 2 years ago

Thanks for reporting this @mvs5465,

I'm pretty sure I've ran into this before as well, but I'm not familiar if we're able to do anything about this or if this is in CloudFormation's control to fix. I don't know of any ways to customize the deletion of a CloudFormation stack, and I'm not sure how stack destruction works under the hood. @otaviomacedo do you know anything about this issue?

mvs5465 commented 2 years ago

@peterwoodworth Thanks, for what it's worth this stopped happening and I haven't quite figured out how to replicate it. We've since added a bunch more config in our Fargate cluster creation, I think it may have stopped when we set the mastersRole property.

adriantaut commented 11 months ago

we are hitting the same unfortunately, all the Manifests/HelmCharts fail to be removed after 3x15mins timeouts between the provider and onEvent handler kubectl CustomResources because the SG rules were destroyed

fsellecchia commented 3 months ago

Hi, any updates on this? is there any workaround? I'm facing a similar issue with eks manifests, all of them fail to delete. I think because the lambda handler gets deleted before deleting the manifests.