aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.56k stars 3.87k forks source link

(aws-ecs): can't update ECS cluster because of "capacity provider is in use" #15366

Open tehGoti opened 3 years ago

tehGoti commented 3 years ago

A previously (before 1.108.0) created ECS cluster with FARGATE and FARGATE_SPOT capacity providers can't be updated anymore because of this error:

Error occurred during operation 'putClusterCapacityProviders SDK error: The specified capacity provider is in use and cannot be removed.

Reproduction Steps

Create an ECS cluster with

capacityProviders: ["FARGATE", "FARGATE_SPOT"],

with a CDK version prior to 1.108.0

Then upgrade CDK to latest version (1.110.1), try to update the stack to

enableFargateCapacityProviders: true,

(because the old property is deprecated)

What did you expect to happen?

Capacity providers should be set correctly and cluster should be updated

What actually happened?

Resource handler returned message: "Error occurred during operation 'putClusterCapacityProviders SDK error: The specified capacity provider is in use and cannot be removed. (Service: AmazonECS; Status Code: 400; Error Code: ResourceInUseException; Request ID: ee1d2754-517d-4148-bbb1-25d0f8a1aa88; Proxy: null)'." (RequestToken: 3d87b953-a468-bffc-d442-862a457b1f9b, HandlerErrorCode: GeneralServiceException)

Environment

Other

Probably caused by https://github.com/aws/aws-cdk/commit/6b2d0e0c867651cd632be9ca99c6e342fb3c1067


This is :bug: Bug Report

luiszimmermann commented 3 years ago

I have this issue when upgrading to 1.111.0 from 1.108.0:

Resource handler returned message: "Error occurred during operation 'putClusterCapacityProviders SDK error: The specified capacity provider is in use and cannot be removed. (Service: AmazonE
CS; Status Code: 400; Error Code: ResourceInUseException; Request ID: 7eb2279f-f05e-41ea-8aa3-bfaa73f49092; Proxy: null)'." (RequestToken: 2d89c59c-3ea3-7761-b679-df734d90b08d, HandlerErrorC
ode: GeneralServiceException)

I already used enable_fargate_capacity_providers=True in the ecs.Cluster so the error exists without any change in the code, only the upgrade.

tobias-nawa commented 3 years ago

Same here: putClusterCapacityProviders SDK error: The specified capacity provider is in use and cannot be removed.

Here is the important part of the diff which is causing this:

[+] AWS::ECS::ClusterCapacityProviderAssociations 

[~] AWS::ECS::Cluster 
 └─ [-] CapacityProviders
  └─ ["FARGATE","FARGATE_SPOT"]

This is happening since I updated CDK to 1.111.0 from 1.104.0. There was no change done in my code. Result is that I'm not able to deploy anymore.

madeline-k commented 3 years ago

Looks like this is a breaking change and we should try to fix this as soon as possible. I am going to try to repro it.

flavioleggio commented 3 years ago

Same here, wouldn't have been easier if it was a new method with a simple deprecation of enableFargateCapacityProviders?

Dzhuneyt commented 3 years ago

I've encountered the same issue recently and it's breaking my deployment.

Previous version was based on various CDK packages, all at version 1.109.0. Code looked like this and was successful to deploy:

    this.cluster = new Cluster(this, 'Cluster', {
          vpc,
          capacityProviders: [
              "FARGATE", "FARGATE_SPOT",
          ],
      });

After upgrading all packages to 1.119.0 it starts to fail with:

Resource handler returned message: "Error occurred during operation 'putClusterCapacityProviders SDK error: The specified capacity provider is in use and
cannot be removed. (Service: AmazonECS; Status Code: 400; Error Code: ResourceInUseException;

I've also tried using enableFargateCapacityProviders: true, instead of capacityProviders. The effect is the same error.

To summarize: Upgrading the AWS CDK ECS package from 1.109.0 to 1.119.0 breaks existing Clusters that had the Fargate capacity provider enabled.

madeline-k commented 3 years ago

The issue was introduced with this bug fix: https://github.com/aws/aws-cdk/pull/15012/

Unfortunately, we are now between a rock and a hard place, because if we just revert that change, customers who have deployed with 1.108.0 and up could run into this same issue.

In the meantime you can apply this workaround to your cluster. This is the best I could come up with, there might be a simpler way.

import * as cdk from '@aws-cdk/core';
import * as ecs from '@aws-cdk/aws-ecs';

export class MyStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const cluster = new ecs.Cluster(this, 'MyCluster', {
      enableFargateCapacityProviders: true,
    });

    // Add back the old method of specifying capacity providers
    const cfnCluster = cluster.node.defaultChild as ecs.CfnCluster;
    cfnCluster.capacityProviders = ['FARGATE', 'FARGATE_SPOT'];

    // Remove the new method of specifying capacity providers. 
    cdk.Aspects.of(this).add(new MyAspect());
  }
}

class MyAspect implements cdk.IAspect {
  public visit(node: cdk.IConstruct): void {
    if (node instanceof ecs.CfnClusterCapacityProviderAssociations) {
      // IMPORTANT: The id supplied here must be the same as the id of your cluster. Don't worry, you won't remove the cluster.  
      node.node.scope?.node.tryRemoveChild('MyCluster');
    }
  }
}

Note that a simple cluster.node.tryRemoveChild('MyCluster') doesn't work here, which is why I used an Aspect to remove the CfnClusterCapacityProviderAssociations resource. I am not sure exactly why this is. I think it could be because the resource was added with an Aspect here.

zsimjee commented 2 years ago

Thanks Madeline! Any idea if addAsgCapacityProvider regressed as a whole? I'm trying it on a brand new ECS cluster and I keep hitting the same exception

Resource handler returned message: "Out of retries. Last encountered error was: The specified capacity provider is in use and cannot be removed. (Service: AmazonECS; Status Code: 400; Error Code: ResourceInUseException; Request ID:

frishrash commented 2 years ago

Thanks @madeline-k !

If anyone is interested in the Pythonic version of this workaround using CDK 2.x:

from aws_cdk import (
    Stack,
    aws_ecs as ecs
)
import jsii

@jsii.implements(cdk.IAspect)
class MyAspect:
  def visit(self, node):
    if isinstance(node, ecs.CfnClusterCapacityProviderAssociations):
      if node.node.scope:
        node.node.scope.node.try_remove_child('MyCluster')
from aws_cdk import (
    Stack,
    Aspects,
    aws_ecs as ecs
)

class MyStack(Stack):
    def __init__(self, scope: Construct, construct_id: str, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)
        cluster = ecs.Cluster(self, 'MyCluster', enable_fargate_capacity_providers=True)
        cfnCluster = cluster.node.default_child
        cfnCluster.capacity_providers = ['FARGATE', 'FARGATE_SPOT']
        Aspects.of(self).add(MyAspect())
phishy commented 2 years ago

I get The specified capacity provider is in use and cannot be removed when cdk destroy brand new stacks with CDK v2.13.0 and add leverage ecs.Cluster capacityProviders: ["FARGATE", "FARGATE_SPOT"]

Is this also known?

ghost commented 2 years ago

I have same issue with cdk destroy. Have had it since.. I can't even remember when it started. Currently running 2.40.0. Workaround is just to run cdk destroy twice.

amliuyong commented 1 year ago

I have same issue with cdk destroy

frishrash commented 1 year ago

The workaround works by removing the cluster from the ClusterCapacityProviderAssociations definition, you can verify in CDK synth output whether it's still there or not.

altso commented 2 months ago

Any updates on the issue? I'm getting this error with cdk destroy.

pahud commented 1 month ago

I believe this is because:

ClusterCapacityProviderAssociations has specified both FARGATE and FARGATE_SPOT capacity provider but FARGATE capacity provider is in use by AWS::ECS::Service so you can't delete ClusterCapacityProviderAssociations before the fargate service. Instead you need delete fargate service before the ClusterCapacityProviderAssociations. Generally you could just ensure the dependency using

service.node.addDependency(cfnassociation)

However, given the CfnClusterCapacityProviderAssociation is created from an Aspect, you can't use escape hatches to ensure the dependency.

The only solution that comes to my mind is using another Aspect.

Consider the full sample below:

class EnsureCapacityProviderAssociationsDependency implements IAspect {
  private service: ecs.BaseService[];

  constructor(service: ecs.BaseService[]) {
    this.service = service;
  }

  public visit(node: IConstruct): void {
    if (node instanceof ecs.CfnClusterCapacityProviderAssociations) {
      console.log('found AWS::ECS::ClusterCapacityProviderAssociations')
      const cfnassociation = node as ecs.CfnClusterCapacityProviderAssociations
      this.service.forEach(s => {
        s.node.addDependency(cfnassociation)
        // cfnassociation.node.addDependency(s)
        console.log('added dependency to ' + s.node.id)
      })
    }
  }
}

export class DummyStack extends Stack {
  constructor(scope: Construct, id: string, props?: StackProps) {
    super(scope, id, props);

    const vpc = getDefaultVpc(this);

    // ECS Cluster
    const cluster = new ecs.Cluster(this, 'EcsCluster', {
      vpc,
      enableFargateCapacityProviders: true,
    });

    // ECS Task Definition
    const ecsTaskDef = new ecs.FargateTaskDefinition(this, 'EcsTaskDef', {
    });

    ecsTaskDef.addContainer('web', {
      image: ecs.ContainerImage.fromRegistry('amazon/amazon-ecs-sample'),
    });

    // ECS Service
    const service = new ecs.FargateService(this, 'EcsService', {
      cluster,
      taskDefinition: ecsTaskDef,
      capacityProviderStrategies: [
        {
          capacityProvider: 'FARGATE',
          base: 0,
          weight: 1,
        },
      ],
    });

    // we need ensure the dependency but CfnClusterCapacityProviderAssociations is added by Aspects
    // https://github.com/aws/aws-cdk/blob/9946ab03672bf6664e8ec95a81ddb67c3bb2f63b/packages/aws-cdk-lib/aws-ecs/lib/cluster.ts#L1380C29-L1380C67
    // so we need another Aspect to fix it
    Aspects.of(cluster).add(new EnsureCapacityProviderAssociationsDependency([service]))
  }
}

On cdk synth you would see the service dependsOn the association

 "EcsService81FC6EF6": {
   "Type": "AWS::ECS::Service",
   "Properties": {
    "CapacityProviderStrategy": [
     {
      "Base": 0,
      "CapacityProvider": "FARGATE",
      "Weight": 1
     }
    ],
   ....
   },
   "DependsOn": [
    "EcsCluster72B17558",  // <--- CfnCapacityProviderAssociations here
    "EcsTaskDefTaskRole4E058A8F"
   ],

With the ensured dependency, service would always be destroyed "before" the CfnCapacityProviderAssociation so you can unblock it.

 % npx cdk destroy
found AWS::ECS::ClusterCapacityProviderAssociations
added dependency to EcsService
Are you sure you want to delete: dummy-stack17 (y/n)? y
dummy-stack17: destroying... [1/1]

 ✅  dummy-stack17: destroyed

Let me know if it works for you.