Open metametadata opened 2 years ago
I see the same thing. It hangs for a LONG time, then finally fails. To workaround it I have to go manually terminate the ECS EC2 instance. If I had to guess it seems related to not being able to shutdown all (or "the last"?) instances properly (note I only had one at the time).
Hi everyone, we have the same issue, not just when deleting a cluster, but when trying to update the AMI ID used for the cluster. Updating the MachineImage in the ASG, leads to a new LaunchConfiguration and therefore a new autoscaling group. Is there any way arround this? Or do we have to write a custom resource to enable and disable termination protection on demand?
Hi everyone, we have the same issue, not just when deleting a cluster, but when trying to update the AMI ID used for the cluster. Updating the MachineImage in the ASG, leads to a new LaunchConfiguration and therefore a new autoscaling group. Is there any way arround this? Or do we have to write a custom resource to enable and disable termination protection on demand?
@fschollmeyer I have this same issue, did you manage to find a workaround?
Hello. The issue for us comes from the following:
Attempting to remove a cluster through CloudFormation when there still are EC2 instances running results in a failure, and instance running perpetually.
The stack sets up, due to the dependencies (and deletes in reverse order)
On removal, capacity provider associations and capacity providers removal results in losing the termination protection management, so currently running instances stay perpetually protected, preventing the stack from being removed.
Our current workaround is to put DeletionPolicy
Preserve
on the AWS::ECS::ClusterCapacityProviderAssociations
resource. This has the deletion fail the first time because Capacity providers can't be removed due to still being referenced in the cluster, then ASG deletion succeeds because managed termination protection still works, then cluster deletion works (effectively removing the ClusterCapacityProviderAssociations
). Subsequent attempt to remove the stack removes the leftover unbound capacity providers successfully.
A solution that seems to work for me is to create a custom resource that calls deleteAutoScalingGroup
on delete (noop on create or on update), and make the capacity provider depend on the custom resource.
A solution that seems to work for me is to create a custom resource that calls
deleteAutoScalingGroup
on delete (noop on create or on update), and make the capacity provider depend on the custom resource.
Shouldn't it be the opposite, that resource depending on the capacity provider association, so that the ASG gets removed before the capacity provider is de-associated from the cluster?
A solution that seems to work for me is to create a custom resource that calls
deleteAutoScalingGroup
on delete (noop on create or on update), and make the capacity provider depend on the custom resource.Shouldn't it be the opposite, that resource depending on the capacity provider association, so that the ASG gets removed before the capacity provider is de-associated from the cluster?
The ASG can't be removed if it's still in use by a capacity provider attached to a cluster. The service gets deleted, then the capacity provider, then the association, then the CR force-deletes the ASG.
The ASG can't be removed if it's still in use by a capacity provider attached to a cluster. The service gets deleted, then the capacity provider, then the association, then the CR force-deletes the ASG.
Hmm that's weird, both the issue and solution are the opposite way for me. Looks like this order would be the natural order CF would do without a custom resource. In my case, removing the ASG before the capacity providers works, and is even what enables it to properly remove despite instance termination protection. (https://github.com/aws/aws-cdk/issues/18179#issuecomment-1061849588)
Yes, the order is not the issue. The issue is that Cloudformation doesn't force-delete the ASG, so it fails to delete it if there are instances protected from scale-in. The custom resource force-deletes the ASG, terminating all instances, including protected ones.
My solution doesn't require retrying the deletion, it works in a single pass.
FWIW, I've just noticed that my cluster deletion has failed quickly with DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining.
. Which is a bit better than hanging. But it also means that the workaround I've described in the first message apparently does not help.
My ASG has enableManagedTerminationProtection = false
.
Versions:
ᐅ cdk --version
2.22.0 (build 1db4b16)
software.amazon.awssdk/ecs "2.17.181"
yuri-cluster | 0 | 6:20:41 PM | DELETE_IN_PROGRESS | AWS::CloudFormation::Stack | yuri-cluster User Initiated
yuri-cluster | 0 | 6:20:44 PM | DELETE_IN_PROGRESS | AWS::SNS::Subscription | asg/DrainECSHook/Function/Topic (asgDrainECSHookFunctionTopicFFE1E612)
yuri-cluster | 0 | 6:20:44 PM | DELETE_IN_PROGRESS | AWS::Lambda::Permission | asg/DrainECSHook/Function/AllowInvoke:yuriclusterasgLifecycleHookDrainHookTopic7A175731 (asgDrainECSHookFunctionAllowInvokeyuriclusterasgLifecycleHookDrainHookTopic7A175731F8622528)
yuri-cluster | 0 | 6:20:44 PM | DELETE_IN_PROGRESS | AWS::ECS::ClusterCapacityProviderAssociations | cluster/cluster (clusterA4C38409)
yuri-cluster | 0 | 6:20:44 PM | DELETE_IN_PROGRESS | AWS::AutoScaling::LifecycleHook | asg/LifecycleHookDrainHook (asgLifecycleHookDrainHook7D987AD1)
yuri-cluster | 0 | 6:20:44 PM | DELETE_IN_PROGRESS | AWS::EC2::SecurityGroupIngress | asg/InstanceSecurityGroup/from yurinetworkbastionsg3D24DB3C:ALL TRAFFIC (asgInstanceSecurityGroupfromyurinetworkbastionsg3D24DB3CALLTRAFFIC8742D7F0)
yuri-cluster | 1 | 6:20:44 PM | DELETE_COMPLETE | AWS::SNS::Subscription | asg/DrainECSHook/Function/Topic (asgDrainECSHookFunctionTopicFFE1E612)
yuri-cluster | 2 | 6:20:45 PM | DELETE_COMPLETE | AWS::EC2::SecurityGroupIngress | asg/InstanceSecurityGroup/from yurinetworkbastionsg3D24DB3C:ALL TRAFFIC (asgInstanceSecurityGroupfromyurinetworkbastionsg3D24DB3CALLTRAFFIC8742D7F0)
yuri-cluster | 3 | 6:20:46 PM | DELETE_COMPLETE | AWS::AutoScaling::LifecycleHook | asg/LifecycleHookDrainHook (asgLifecycleHookDrainHook7D987AD1)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster | 3 | 6:20:46 PM | DELETE_IN_PROGRESS | AWS::IAM::Policy | asg/LifecycleHookDrainHook/Role/DefaultPolicy (asgLifecycleHookDrainHookRoleDefaultPolicy0B1C44ED)
yuri-cluster | 4 | 6:20:47 PM | DELETE_COMPLETE | AWS::IAM::Policy | asg/LifecycleHookDrainHook/Role/DefaultPolicy (asgLifecycleHookDrainHookRoleDefaultPolicy0B1C44ED)
yuri-cluster | 4 | 6:20:48 PM | DELETE_IN_PROGRESS | AWS::IAM::Role | asg/LifecycleHookDrainHook/Role (asgLifecycleHookDrainHookRole3C1C981B)
yuri-cluster | 5 | 6:20:49 PM | DELETE_COMPLETE | AWS::IAM::Role | asg/LifecycleHookDrainHook/Role (asgLifecycleHookDrainHookRole3C1C981B)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster | 6 | 6:20:54 PM | DELETE_COMPLETE | AWS::Lambda::Permission | asg/DrainECSHook/Function/AllowInvoke:yuriclusterasgLifecycleHookDrainHookTopic7A175731 (asgDrainECSHookFunctionAllowInvokeyuriclusterasgLifecycleHookDrainHookTopic7A175731F8622528)
yuri-cluster | 6 | 6:20:55 PM | DELETE_IN_PROGRESS | AWS::SNS::Topic | asg/LifecycleHookDrainHook/Topic (asgLifecycleHookDrainHookTopicC6CABF48)
yuri-cluster | 6 | 6:20:55 PM | DELETE_IN_PROGRESS | AWS::Lambda::Function | asg/DrainECSHook/Function (asgDrainECSHookFunction4A673AE9)
yuri-cluster | 7 | 6:20:55 PM | DELETE_COMPLETE | AWS::SNS::Topic | asg/LifecycleHookDrainHook/Topic (asgLifecycleHookDrainHookTopicC6CABF48)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster | 8 | 6:21:02 PM | DELETE_COMPLETE | AWS::Lambda::Function | asg/DrainECSHook/Function (asgDrainECSHookFunction4A673AE9)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster | 8 | 6:21:03 PM | DELETE_IN_PROGRESS | AWS::IAM::Policy | asg/DrainECSHook/Function/ServiceRole/DefaultPolicy (asgDrainECSHookFunctionServiceRoleDefaultPolicy4BFB0871)
yuri-cluster | 9 | 6:21:04 PM | DELETE_COMPLETE | AWS::IAM::Policy | asg/DrainECSHook/Function/ServiceRole/DefaultPolicy (asgDrainECSHookFunctionServiceRoleDefaultPolicy4BFB0871)
yuri-cluster | 9 | 6:21:05 PM | DELETE_IN_PROGRESS | AWS::IAM::Role | asg/DrainECSHook/Function/ServiceRole (asgDrainECSHookFunctionServiceRoleC052B966)
yuri-cluster | 10 | 6:21:07 PM | DELETE_COMPLETE | AWS::IAM::Role | asg/DrainECSHook/Function/ServiceRole (asgDrainECSHookFunctionServiceRoleC052B966)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster | 11 | 6:21:15 PM | DELETE_COMPLETE | AWS::ECS::ClusterCapacityProviderAssociations | cluster/cluster (clusterA4C38409)
yuri-cluster | 11 | 6:21:16 PM | DELETE_IN_PROGRESS | AWS::ECS::CapacityProvider | asg-capacity-provider/asg-capacity-provider (asgcapacityprovider23F38F59)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster | 12 | 6:21:39 PM | DELETE_COMPLETE | AWS::ECS::CapacityProvider | asg-capacity-provider/asg-capacity-provider (asgcapacityprovider23F38F59)
yuri-cluster | 12 | 6:21:39 PM | DELETE_IN_PROGRESS | AWS::AutoScaling::AutoScalingGroup | asg/ASG (asgASG4D014670)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
12 Currently in progress: yuri-cluster, asgASG4D014670
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster | 13 | 6:23:13 PM | DELETE_COMPLETE | AWS::AutoScaling::AutoScalingGroup | asg/ASG (asgASG4D014670)
yuri-cluster | 13 | 6:23:14 PM | DELETE_IN_PROGRESS | AWS::AutoScaling::LaunchConfiguration | asg/LaunchConfig (asgLaunchConfig37FDE42B)
Stack yuri-cluster has an ongoing operation in progress and is not stable (DELETE_IN_PROGRESS)
yuri-cluster | 14 | 6:23:16 PM | DELETE_COMPLETE | AWS::AutoScaling::LaunchConfiguration | asg/LaunchConfig (asgLaunchConfig37FDE42B)
yuri-cluster | 14 | 6:23:17 PM | DELETE_IN_PROGRESS | AWS::IAM::Policy | asg/InstanceRole/DefaultPolicy (asgInstanceRoleDefaultPolicyFF611E81)
yuri-cluster | 14 | 6:23:17 PM | DELETE_IN_PROGRESS | AWS::IAM::InstanceProfile | asg/InstanceProfile (asgInstanceProfile4E44E320)
yuri-cluster | 14 | 6:23:17 PM | DELETE_IN_PROGRESS | AWS::EC2::SecurityGroup | asg/InstanceSecurityGroup (asgInstanceSecurityGroup5CEB2975)
yuri-cluster | 15 | 6:23:18 PM | DELETE_COMPLETE | AWS::IAM::Policy | asg/InstanceRole/DefaultPolicy (asgInstanceRoleDefaultPolicyFF611E81)
yuri-cluster | 16 | 6:23:18 PM | DELETE_COMPLETE | AWS::EC2::SecurityGroup | asg/InstanceSecurityGroup (asgInstanceSecurityGroup5CEB2975)
yuri-cluster | 16 | 6:23:19 PM | DELETE_IN_PROGRESS | AWS::ECS::Cluster | cluster (cluster611F8AFF)
yuri-cluster | 17 | 6:23:19 PM | DELETE_COMPLETE | AWS::IAM::InstanceProfile | asg/InstanceProfile (asgInstanceProfile4E44E320)
yuri-cluster | 17 | 6:23:19 PM | DELETE_IN_PROGRESS | AWS::IAM::Role | asg/InstanceRole (asgInstanceRole8AC4201C)
yuri-cluster | 17 | 6:23:20 PM | DELETE_FAILED | AWS::ECS::Cluster | cluster (cluster611F8AFF) Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException; Request ID: da1c2bd7-3b77-439e-bc7c-2df846bb453d; Proxy: null)'." (RequestToken: 911f3d22-37dc-31f0-036f-586d3c982188, HandlerErrorCode: GeneralServiceException)
yuri-cluster | 18 | 6:23:21 PM | DELETE_COMPLETE | AWS::IAM::Role | asg/InstanceRole (asgInstanceRole8AC4201C)
yuri-cluster | 18 | 6:23:21 PM | DELETE_FAILED | AWS::CloudFormation::Stack | yuri-cluster The following resource(s) failed to delete: [cluster611F8AFF].
Failed resources:
yuri-cluster | 6:23:20 PM | DELETE_FAILED | AWS::ECS::Cluster | cluster (cluster611F8AFF) Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException; Request ID: da1c2bd7-3b77-439e-bc7c-2df846bb453d; Proxy: null)'." (RequestToken: 911f3d22-37dc-31f0-036f-586d3c982188, HandlerErrorCode: GeneralServiceException)
The solution suggested by @gshpychka works great for us. In our case, we were experiencing the same problem, not with a capacity provider but with a custom termination policy lambda.
Normally, the CDK wants to delete the ASG, which triggers a scale-in that waits for instances to terminate, but while that happens the CDK is dismantling the roles and permissions of the custom termination policy lambda, so it can no longer tell the ASG that any instances are safe to terminate.
In this case you can create the custom resource, then make it depend on the ASG. That forces your CR to be deleted before the ASG, which force-deletes the ASG, preventing it from calling the custom termination policy.
const asgForceDelete = new cr.AwsCustomResource(this, 'AsgForceDelete', {
onDelete: {
service: 'AutoScaling',
action: 'deleteAutoScalingGroup',
parameters: {
AutoScalingGroupName: this.autoScalingGroup.autoScalingGroupName,
ForceDelete: true
}
},
policy: cr.AwsCustomResourcePolicy.fromSdkCalls({
resources: cr.AwsCustomResourcePolicy.ANY_RESOURCE
})
});
asgForceDelete.node.addDependency(this.autoScalingGroup);
The solution above mostly works. Note that if any changes are made to the resource that causes it to be deleted and recreated it will also delete the cluster, which will of course not be recreated, leaving the stack drifting.
After digging into this and reading through the mentioned CloudFormation issue it seems to me like this is a situation that CloudFormation is working to fix & improve. At the least we should be getting a relatively quick error from CloudFormation rather than having to wait for the timeout. From my research It wasn't clear to me if CloudFormation intends for ASG's configured with managedTerminationProtection: 'ENABLED'
to be automatically cleaned up by CloudFormation. It may turn out that they decide to require manually disabling instance's scale-in protection
similar to how non-empty S3 buckets are handled by CloudFormation today. If we can get a definitive answer on this and it turns out to be the case then we should probably look into adding an opt-in custom-resource for ASG cleanup (similar to how we handle auto deleting objects in a Bucket
via autoDeleteObjects
).
In the meantime I've created a PR that improves some of our documentation for the enableManaged*
options. I've also added a note of the delete behavior to the ECS README (the ECS overview doc page) and a link to this issue for anyone who is interested in workarounds such as the custom resource solution that @elliot-nelson suggested (thanks for sharing!)
Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.
@ryparker I think in "Related to but does not fix: https://github.com/aws/aws-cdk/issues/18179" the bot may have captured "fix: https://github.com/aws/aws-cdk/issues/18179" ^^ Issue should probably be reopened.
Hey all, I've created a reference CloudFormation template that demonstrates how to avoid this issue. The end to end solution for the capacity provider with working teardown can be found here: https://containersonaws.com/pattern/ecs-ec2-capacity-provider-scaling
You can also refer directly to the sample code for the Lambda function here: https://github.com/aws-samples/container-patterns/blob/main/pattern/ecs-ec2-capacity-provider-scaling/files/cluster-capacity-provider.yml#L48-L123
In short, this solution implements a custom ASG destroyer resource, which is used to force kill the ASG so that it does not block the CloudFormation stack teardown.
Hello there fellas, I was using the workaround with force deleting the ASG using custom resource for some time and it worked great.
Leately (last few weeks), I have started to get following error:
Resource handler returned message: "Error occurred during operation 'DeleteClusters SDK Error: The Cluster cannot be
deleted while Container Instances are active or draining. (Service: AmazonECS; Status Code: 400; Error Code: ClusterContainsContainerInstancesException
How is this possible? The ASG is fully deleted before the cluster delete is initiated (I can see it in the Cloudformation events and ASG resource is dependent on cluster). If the ASG is deleted, all of the instances should be deleted as well.
See the attached screenshot of CF events as well
What is with the sudden behavior change?
What is the problem?
The deletion of stack with
AsgCapacityProvider
hangs unexpectedly.It is surprising as we didn't have such an issue with now deprecated
addCapacity
and we have no ECS tasks in ASG when we delete the stack.The behaviour seems to be caused by the default
enableManagedTerminationProtection = true
.See the discussion in the original closed issue and my unaddressed comment: https://github.com/aws/aws-cdk/issues/14732#issuecomment-991402770.
Reproduction Steps
Please see https://github.com/aws/aws-cdk/issues/14732.
In short, try to delete the stack with ECS cluster which uses
AsgCapacityProvider
defaults.What did you expect to happen?
Either:
What actually happened?
The CF stack got stuck in DELETE_IN_PROGRESS.
CDK CLI Version
2.3.0
Framework Version
2.3.0
Node.js Version
v16.8.0
OS
macOS
Language
Java
Language Version
11.0.8
Other information
Workaround
My current workaround: set
AsgCapacityProvider
enableManagedTerminationProtection = false
.Documentation questions/enhancement requests
From https://docs.aws.amazon.com/cdk/api/latest/docs/aws-ecs-readme.html (emphasis mine):
1) It's not fully clear from the description that the flag simply disables deletion of ASG. I got an incorrect impression that it somehow cleverly understands that there are no ECS tasks running and allows deletion in such case. 1) What are the risks of turning this protection off? E.g. we don't want ECS tasks to shut down at random times. 1) Is it OK to set
enableManagedTerminationProtection=false
+enableManagedScaling=true
? It seems to work but is against the documentation ("If you want to disable this behavior, set both enableManagedScaling to and enableManagedTerminationProtection to false.").