aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.52k stars 3.86k forks source link

(aws-ecs): hanging on deleting a stack with ASG capacity provider #18179

Open metametadata opened 2 years ago

metametadata commented 2 years ago

What is the problem?

The deletion of stack with AsgCapacityProvider hangs unexpectedly.

It is surprising as we didn't have such an issue with now deprecated addCapacity and we have no ECS tasks in ASG when we delete the stack.

The behaviour seems to be caused by the default enableManagedTerminationProtection = true.

See the discussion in the original closed issue and my unaddressed comment: https://github.com/aws/aws-cdk/issues/14732#issuecomment-991402770.

Reproduction Steps

Please see https://github.com/aws/aws-cdk/issues/14732.

In short, try to delete the stack with ECS cluster which uses AsgCapacityProvider defaults.

What did you expect to happen?

Either:

What actually happened?

The CF stack got stuck in DELETE_IN_PROGRESS.

CDK CLI Version

2.3.0

Framework Version

2.3.0

Node.js Version

v16.8.0

OS

macOS

Language

Java

Language Version

11.0.8

Other information

Workaround

My current workaround: set AsgCapacityProvider enableManagedTerminationProtection = false.

Documentation questions/enhancement requests

From https://docs.aws.amazon.com/cdk/api/latest/docs/aws-ecs-readme.html (emphasis mine):

By default, an Auto Scaling Group Capacity Provider will manage the Auto Scaling Group's size for you. It will also enable managed termination protection, in order to prevent EC2 Auto Scaling from terminating EC2 instances that have tasks running on them. If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.

1) It's not fully clear from the description that the flag simply disables deletion of ASG. I got an incorrect impression that it somehow cleverly understands that there are no ECS tasks running and allows deletion in such case. 1) What are the risks of turning this protection off? E.g. we don't want ECS tasks to shut down at random times. 1) Is it OK to set enableManagedTerminationProtection=false + enableManagedScaling=true? It seems to work but is against the documentation ("If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.").

jcpage commented 2 years ago

I see the same thing. It hangs for a LONG time, then finally fails. To workaround it I have to go manually terminate the ECS EC2 instance. If I had to guess it seems related to not being able to shutdown all (or "the last"?) instances properly (note I only had one at the time).

fschollmeyer commented 2 years ago

Hi everyone, we have the same issue, not just when deleting a cluster, but when trying to update the AMI ID used for the cluster. Updating the MachineImage in the ASG, leads to a new LaunchConfiguration and therefore a new autoscaling group. Is there any way arround this? Or do we have to write a custom resource to enable and disable termination protection on demand?

A-Mckinlay commented 2 years ago

Hi everyone, we have the same issue, not just when deleting a cluster, but when trying to update the AMI ID used for the cluster. Updating the MachineImage in the ASG, leads to a new LaunchConfiguration and therefore a new autoscaling group. Is there any way arround this? Or do we have to write a custom resource to enable and disable termination protection on demand?

@fschollmeyer I have this same issue, did you manage to find a workaround?

Ten0 commented 2 years ago

Hello. The issue for us comes from the following:

Attempting to remove a cluster through CloudFormation when there still are EC2 instances running results in a failure, and instance running perpetually.

The stack sets up, due to the dependencies (and deletes in reverse order)

On removal, capacity provider associations and capacity providers removal results in losing the termination protection management, so currently running instances stay perpetually protected, preventing the stack from being removed.

Our current workaround is to put DeletionPolicy Preserve on the AWS::ECS::ClusterCapacityProviderAssociations resource. This has the deletion fail the first time because Capacity providers can't be removed due to still being referenced in the cluster, then ASG deletion succeeds because managed termination protection still works, then cluster deletion works (effectively removing the ClusterCapacityProviderAssociations). Subsequent attempt to remove the stack removes the leftover unbound capacity providers successfully.

gshpychka commented 2 years ago

A solution that seems to work for me is to create a custom resource that calls deleteAutoScalingGroup on delete (noop on create or on update), and make the capacity provider depend on the custom resource.

Ten0 commented 2 years ago

A solution that seems to work for me is to create a custom resource that calls deleteAutoScalingGroup on delete (noop on create or on update), and make the capacity provider depend on the custom resource.

Shouldn't it be the opposite, that resource depending on the capacity provider association, so that the ASG gets removed before the capacity provider is de-associated from the cluster?

gshpychka commented 2 years ago

A solution that seems to work for me is to create a custom resource that calls deleteAutoScalingGroup on delete (noop on create or on update), and make the capacity provider depend on the custom resource.

Shouldn't it be the opposite, that resource depending on the capacity provider association, so that the ASG gets removed before the capacity provider is de-associated from the cluster?

The ASG can't be removed if it's still in use by a capacity provider attached to a cluster. The service gets deleted, then the capacity provider, then the association, then the CR force-deletes the ASG.

Ten0 commented 2 years ago

The ASG can't be removed if it's still in use by a capacity provider attached to a cluster. The service gets deleted, then the capacity provider, then the association, then the CR force-deletes the ASG.

Hmm that's weird, both the issue and solution are the opposite way for me. Looks like this order would be the natural order CF would do without a custom resource. In my case, removing the ASG before the capacity providers works, and is even what enables it to properly remove despite instance termination protection. (https://github.com/aws/aws-cdk/issues/18179#issuecomment-1061849588)

gshpychka commented 2 years ago

Yes, the order is not the issue. The issue is that Cloudformation doesn't force-delete the ASG, so it fails to delete it if there are instances protected from scale-in. The custom resource force-deletes the ASG, terminating all instances, including protected ones.

My solution doesn't require retrying the deletion, it works in a single pass.

metametadata commented 2 years ago

FWIW, I've just noticed that my cluster deletion has failed quickly with DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining.. Which is a bit better than hanging. But it also means that the workaround I've described in the first message apparently does not help.

ᐅ cdk --version
2.22.0 (build 1db4b16)
software.amazon.awssdk/ecs "2.17.181"
yuri-cluster |   0 | 6:20:41 PM | DELETE_IN_PROGRESS   | AWS::CloudFormation::Stack                    | yuri-cluster User Initiated
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::SNS::Subscription                        | asg/DrainECSHook/Function/Topic (asgDrainECSHookFunctionTopicFFE1E612) 
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::Lambda::Permission                       | asg/DrainECSHook/Function/AllowInvoke:yuriclusterasgLifecycleHookDrainHookTopic7A175731 (asgDrainECSHookFunctionAllowInvokeyuriclusterasgLifecycleHookDrainHookTopic7A175731F8622528) 
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::ECS::ClusterCapacityProviderAssociations | cluster/cluster (clusterA4C38409) 
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::LifecycleHook               | asg/LifecycleHookDrainHook (asgLifecycleHookDrainHook7D987AD1) 
yuri-cluster |   0 | 6:20:44 PM | DELETE_IN_PROGRESS   | AWS::EC2::SecurityGroupIngress                | asg/InstanceSecurityGroup/from yurinetworkbastionsg3D24DB3C:ALL TRAFFIC (asgInstanceSecurityGroupfromyurinetworkbastionsg3D24DB3CALLTRAFFIC8742D7F0) 
yuri-cluster |   1 | 6:20:44 PM | DELETE_COMPLETE      | AWS::SNS::Subscription                        | asg/DrainECSHook/Function/Topic (asgDrainECSHookFunctionTopicFFE1E612) 
yuri-cluster |   2 | 6:20:45 PM | DELETE_COMPLETE      | AWS::EC2::SecurityGroupIngress                | asg/InstanceSecurityGroup/from yurinetworkbastionsg3D24DB3C:ALL TRAFFIC (asgInstanceSecurityGroupfromyurinetworkbastionsg3D24DB3CALLTRAFFIC8742D7F0) 
yuri-cluster |   3 | 6:20:46 PM | DELETE_COMPLETE      | AWS::AutoScaling::LifecycleHook               | asg/LifecycleHookDrainHook (asgLifecycleHookDrainHook7D987AD1) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |   3 | 6:20:46 PM | DELETE_IN_PROGRESS   | AWS::IAM::Policy                              | asg/LifecycleHookDrainHook/Role/DefaultPolicy (asgLifecycleHookDrainHookRoleDefaultPolicy0B1C44ED) 
yuri-cluster |   4 | 6:20:47 PM | DELETE_COMPLETE      | AWS::IAM::Policy                              | asg/LifecycleHookDrainHook/Role/DefaultPolicy (asgLifecycleHookDrainHookRoleDefaultPolicy0B1C44ED) 
yuri-cluster |   4 | 6:20:48 PM | DELETE_IN_PROGRESS   | AWS::IAM::Role                                | asg/LifecycleHookDrainHook/Role (asgLifecycleHookDrainHookRole3C1C981B) 
yuri-cluster |   5 | 6:20:49 PM | DELETE_COMPLETE      | AWS::IAM::Role                                | asg/LifecycleHookDrainHook/Role (asgLifecycleHookDrainHookRole3C1C981B) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |   6 | 6:20:54 PM | DELETE_COMPLETE      | AWS::Lambda::Permission                       | asg/DrainECSHook/Function/AllowInvoke:yuriclusterasgLifecycleHookDrainHookTopic7A175731 (asgDrainECSHookFunctionAllowInvokeyuriclusterasgLifecycleHookDrainHookTopic7A175731F8622528) 
yuri-cluster |   6 | 6:20:55 PM | DELETE_IN_PROGRESS   | AWS::SNS::Topic                               | asg/LifecycleHookDrainHook/Topic (asgLifecycleHookDrainHookTopicC6CABF48) 
yuri-cluster |   6 | 6:20:55 PM | DELETE_IN_PROGRESS   | AWS::Lambda::Function                         | asg/DrainECSHook/Function (asgDrainECSHookFunction4A673AE9) 
yuri-cluster |   7 | 6:20:55 PM | DELETE_COMPLETE      | AWS::SNS::Topic                               | asg/LifecycleHookDrainHook/Topic (asgLifecycleHookDrainHookTopicC6CABF48) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |   8 | 6:21:02 PM | DELETE_COMPLETE      | AWS::Lambda::Function                         | asg/DrainECSHook/Function (asgDrainECSHookFunction4A673AE9) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |   8 | 6:21:03 PM | DELETE_IN_PROGRESS   | AWS::IAM::Policy                              | asg/DrainECSHook/Function/ServiceRole/DefaultPolicy (asgDrainECSHookFunctionServiceRoleDefaultPolicy4BFB0871) 
yuri-cluster |   9 | 6:21:04 PM | DELETE_COMPLETE      | AWS::IAM::Policy                              | asg/DrainECSHook/Function/ServiceRole/DefaultPolicy (asgDrainECSHookFunctionServiceRoleDefaultPolicy4BFB0871) 
yuri-cluster |   9 | 6:21:05 PM | DELETE_IN_PROGRESS   | AWS::IAM::Role                                | asg/DrainECSHook/Function/ServiceRole (asgDrainECSHookFunctionServiceRoleC052B966) 
yuri-cluster |  10 | 6:21:07 PM | DELETE_COMPLETE      | AWS::IAM::Role                                | asg/DrainECSHook/Function/ServiceRole (asgDrainECSHookFunctionServiceRoleC052B966) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |  11 | 6:21:15 PM | DELETE_COMPLETE      | AWS::ECS::ClusterCapacityProviderAssociations | cluster/cluster (clusterA4C38409) 
yuri-cluster |  11 | 6:21:16 PM | DELETE_IN_PROGRESS   | AWS::ECS::CapacityProvider                    | asg-capacity-provider/asg-capacity-provider (asgcapacityprovider23F38F59) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |  12 | 6:21:39 PM | DELETE_COMPLETE      | AWS::ECS::CapacityProvider                    | asg-capacity-provider/asg-capacity-provider (asgcapacityprovider23F38F59) 
yuri-cluster |  12 | 6:21:39 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::AutoScalingGroup            | asg/ASG (asgASG4D014670) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
 12 Currently in progress: yuri-cluster, asgASG4D014670
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |  13 | 6:23:13 PM | DELETE_COMPLETE      | AWS::AutoScaling::AutoScalingGroup            | asg/ASG (asgASG4D014670) 
yuri-cluster |  13 | 6:23:14 PM | DELETE_IN_PROGRESS   | AWS::AutoScaling::LaunchConfiguration         | asg/LaunchConfig (asgLaunchConfig37FDE42B) 
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster |  14 | 6:23:16 PM | DELETE_COMPLETE      | AWS::AutoScaling::LaunchConfiguration         | asg/LaunchConfig (asgLaunchConfig37FDE42B) 
yuri-cluster |  14 | 6:23:17 PM | DELETE_IN_PROGRESS   | AWS::IAM::Policy                              | asg/InstanceRole/DefaultPolicy (asgInstanceRoleDefaultPolicyFF611E81) 
yuri-cluster |  14 | 6:23:17 PM | DELETE_IN_PROGRESS   | AWS::IAM::InstanceProfile                     | asg/InstanceProfile (asgInstanceProfile4E44E320) 
yuri-cluster |  14 | 6:23:17 PM | DELETE_IN_PROGRESS   | AWS::EC2::SecurityGroup                       | asg/InstanceSecurityGroup (asgInstanceSecurityGroup5CEB2975) 
yuri-cluster |  15 | 6:23:18 PM | DELETE_COMPLETE      | AWS::IAM::Policy                              | asg/InstanceRole/DefaultPolicy (asgInstanceRoleDefaultPolicyFF611E81) 
yuri-cluster |  16 | 6:23:18 PM | DELETE_COMPLETE      | AWS::EC2::SecurityGroup                       | asg/InstanceSecurityGroup (asgInstanceSecurityGroup5CEB2975) 
yuri-cluster |  16 | 6:23:19 PM | DELETE_IN_PROGRESS   | AWS::ECS::Cluster                             | cluster (cluster611F8AFF) 
yuri-cluster |  17 | 6:23:19 PM | DELETE_COMPLETE      | AWS::IAM::InstanceProfile                     | asg/InstanceProfile (asgInstanceProfile4E44E320) 
yuri-cluster |  17 | 6:23:19 PM | DELETE_IN_PROGRESS   | AWS::IAM::Role                                | asg/InstanceRole (asgInstanceRole8AC4201C) 
yuri-cluster |  17 | 6:23:20 PM | DELETE_FAILED        | AWS::ECS::Cluster                             | cluster (cluster611F8AFF) Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException; Request ID: da1c2bd7-3b77-439e-bc7c-2df846bb453d; Proxy: null)'." (RequestToken: 911f3d22-37dc-31f0-036f-586d3c982188, HandlerErrorCode: GeneralServiceException)
yuri-cluster |  18 | 6:23:21 PM | DELETE_COMPLETE      | AWS::IAM::Role                                | asg/InstanceRole (asgInstanceRole8AC4201C) 
yuri-cluster |  18 | 6:23:21 PM | DELETE_FAILED        | AWS::CloudFormation::Stack                    | yuri-cluster The following resource(s) failed to delete: [cluster611F8AFF]. 

Failed resources:
yuri-cluster | 6:23:20 PM | DELETE_FAILED        | AWS::ECS::Cluster                             | cluster (cluster611F8AFF) Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException; Request ID: da1c2bd7-3b77-439e-bc7c-2df846bb453d; Proxy: null)'." (RequestToken: 911f3d22-37dc-31f0-036f-586d3c982188, HandlerErrorCode: GeneralServiceException)
elliot-nelson commented 2 years ago

The solution suggested by @gshpychka works great for us. In our case, we were experiencing the same problem, not with a capacity provider but with a custom termination policy lambda.

Normally, the CDK wants to delete the ASG, which triggers a scale-in that waits for instances to terminate, but while that happens the CDK is dismantling the roles and permissions of the custom termination policy lambda, so it can no longer tell the ASG that any instances are safe to terminate.

In this case you can create the custom resource, then make it depend on the ASG. That forces your CR to be deleted before the ASG, which force-deletes the ASG, preventing it from calling the custom termination policy.

    const asgForceDelete = new cr.AwsCustomResource(this, 'AsgForceDelete', {
      onDelete: {
        service: 'AutoScaling',
        action: 'deleteAutoScalingGroup',
        parameters: {
          AutoScalingGroupName: this.autoScalingGroup.autoScalingGroupName,
          ForceDelete: true
        }
      },
      policy: cr.AwsCustomResourcePolicy.fromSdkCalls({
        resources: cr.AwsCustomResourcePolicy.ANY_RESOURCE
      })
    });
    asgForceDelete.node.addDependency(this.autoScalingGroup);
frjonsen commented 1 year ago

The solution above mostly works. Note that if any changes are made to the resource that causes it to be deleted and recreated it will also delete the cluster, which will of course not be recreated, leaving the stack drifting.

ryparker commented 1 year ago

After digging into this and reading through the mentioned CloudFormation issue it seems to me like this is a situation that CloudFormation is working to fix & improve. At the least we should be getting a relatively quick error from CloudFormation rather than having to wait for the timeout. From my research It wasn't clear to me if CloudFormation intends for ASG's configured with managedTerminationProtection: 'ENABLED' to be automatically cleaned up by CloudFormation. It may turn out that they decide to require manually disabling instance's scale-in protection similar to how non-empty S3 buckets are handled by CloudFormation today. If we can get a definitive answer on this and it turns out to be the case then we should probably look into adding an opt-in custom-resource for ASG cleanup (similar to how we handle auto deleting objects in a Bucket via autoDeleteObjects).

In the meantime I've created a PR that improves some of our documentation for the enableManaged* options. I've also added a note of the delete behavior to the ECS README (the ECS overview doc page) and a link to this issue for anyone who is interested in workarounds such as the custom resource solution that @elliot-nelson suggested (thanks for sharing!)

github-actions[bot] commented 1 year ago

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

Ten0 commented 1 year ago

@ryparker I think in "Related to but does not fix: https://github.com/aws/aws-cdk/issues/18179" the bot may have captured "fix: https://github.com/aws/aws-cdk/issues/18179" ^^ Issue should probably be reopened.

nathanpeck commented 8 months ago

Hey all, I've created a reference CloudFormation template that demonstrates how to avoid this issue. The end to end solution for the capacity provider with working teardown can be found here: https://containersonaws.com/pattern/ecs-ec2-capacity-provider-scaling

You can also refer directly to the sample code for the Lambda function here: https://github.com/aws-samples/container-patterns/blob/main/pattern/ecs-ec2-capacity-provider-scaling/files/cluster-capacity-provider.yml#L48-L123

In short, this solution implements a custom ASG destroyer resource, which is used to force kill the ASG so that it does not block the CloudFormation stack teardown.

simi-obs commented 2 months ago

Hello there fellas, I was using the workaround with force deleting the ASG using custom resource for some time and it worked great.

Leately (last few weeks), I have started to get following error:

Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be
             deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException

How is this possible? The ASG is fully deleted before the cluster delete is initiated (I can see it in the Cloudformation events and ASG resource is dependent on cluster). If the ASG is deleted, all of the instances should be deleted as well.

See the attached screenshot of CF events as well

image

What is with the sudden behavior change?