aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.5k stars 3.84k forks source link

aws_ecs: "Resource timed out waiting for completion" error during stack deletion #25969

Open IllarionovDimitri opened 1 year ago

IllarionovDimitri commented 1 year ago

Describe the bug

I am running the ECS cluster with ASG as capacity provider (due to GPU loads) on one single EC2.

In order to avoid app down times during ecs task update I have set enable_managed_scaling=True in ecs.AsgCapacityProvider() with the goal that ecs first spins up a new instance, places task on it and only after that the previous instance will be deregistered and terminated.

Enabling of managed scaling adds two CloudWatch alarms behind the scenes.

Bildschirmfoto 2023-06-13 um 16 13 12

The problem is now that the instance termination happens now only after 15 minutes according to the alarm setting. During stack deletion I obtain "Resource timed out waiting for completion" error, which crashes the CI/CD pipeline, which manages the stacks.

I have not found way to override the 15 min setting on the template, since this is how it looks in it.

"felgandev7ecsclusterstackfelgandev7capacityprovider2150902F": {
   "Type": "AWS::ECS::CapacityProvider",
   "Properties": {
    "AutoScalingGroupProvider": {
     "AutoScalingGroupArn": {
      "Ref": "felgandev7asgstackfelgandev7asgASG4A2CB50E"
     },
     "ManagedScaling": {
      "Status": "ENABLED",
      "TargetCapacity": 100
     },
     "ManagedTerminationProtection": "DISABLED"
    },
    "Name": "felgan-dev-7-capacity-provider",
    "Tags": [
     {
      "Key": "project",
      "Value": "felgan"
     },
     {
      "Key": "stack",
      "Value": "storage-stack"
     }
    ]
   },

Expected Behavior

Enabling of managed scaling in the ECS for ASG capacity provider either does not have "collisions" with stack timeouts or there is a way to alter the CloudWatch rules (e.g. lower the 15 min threshold) via cdk.

Current Behavior

During stack deletion with enable_managed_scaling=True in ecs.AsgCapacityProvider() "Resource timed out waiting for completion" error will be raised and stack deletion fails

Reproduction Steps

In order to reproduce the issue a lot of components must be deployed so I can assist with further information if needed since the stack is up and running

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.83.0

Framework Version

No response

Node.js Version

18

OS

Ubuntu 20.04 LTS

Language

Python

Language Version

3.1.0.6

Other information

No response

pahud commented 1 year ago

Sounds like it happens when you delete the stack. Where did you see the Resource timed out waiting for completion error? Is it from CloudFormation? Can you share more screenshots for it? I am wondering which resource was timed out waiting for completion. Any more screenshots would be helpful.

IllarionovDimitri commented 1 year ago

yes, as mentioned in a title the timeout comes during stack deletion.

here is the very first failure during stack deletion

Bildschirmfoto 2023-06-15 um 10 01 01

here is how I define the capacity provider

ecs.AsgCapacityProvider(
            self,
            f"{config.ID}-capacity-provider",
            capacity_provider_name=f"{config.ID}-capacity-provider",
            enable_managed_scaling=True,
            enable_managed_termination_protection=False,
            auto_scaling_group=asg,
        )

the issue comes after I have set enable_managed_scaling=True. this setting adds two Cloudwatch Alarms, one of them delays instance termination to 15 min, which can not be overridden in the template or cdk

Bildschirmfoto 2023-06-15 um 09 12 13
IllarionovDimitri commented 1 year ago

ok, since nothing else worked, I had to implement a workaround based on custom resource:

sg_parameters = {
             "AutoScalingGroupName": asg.auto_scaling_group_name,
             "ForceDelete": True,
         }

 asg_sdk_call_params = {
     "action": "deleteAutoScalingGroup",
     "service": "AutoScaling",
     "parameters": asg_parameters,
     "physical_resource_id": cr.PhysicalResourceId.of(asg.node.id),
 }

 asg_force_delete = cr.AwsCustomResource(
     self,
     f"{config.ID}-cr-delete-asg",
     install_latest_aws_sdk=False,
     on_delete=cr.AwsSdkCall(**asg_sdk_call_params),
     policy=cr.AwsCustomResourcePolicy.from_sdk_calls(
         resources=cr.AwsCustomResourcePolicy.ANY_RESOURCE
     ),
 )

 asg_force_delete.node.add_dependency(asg)
 asg_force_delete.node.add_dependency(ecs_cluster)