aws / aws-cdk

The AWS Cloud Development Kit is a framework for defining cloud infrastructure in code
https://aws.amazon.com/cdk
Apache License 2.0
11.35k stars 3.77k forks source link

aws-ecs: ApplicationLoadBalancedFargateService generates stacks hung in "UPDATE_IN_PROGRESS" and failed health checks #30728

Open lamontadams opened 4 days ago

lamontadams commented 4 days ago

Describe the bug

Initial deployments using ApplicationLoadBalancedFargateService from ecs-patterns complete successfully and generate working, healthy, reachable services. All subsequent deployments fail with a particular series of events:

  1. The CF stack hangs in "UPDATE_IN_PROGRESS" status
  2. CDK hangs at "UPDATE_ IN_PROGRESS | AWS:ECS:SERVICE"
  3. The ECS cluster console shows "Services: Draining" and "Tasks:Pending"
  4. The ECS service accumulates deployment failures and shows this cycle of events in the event log:
    service ... has started 1 tasks
    service ... port 3000 is unhealthy (reason Health checks failed).
    service ... has stopped 1 running tasks
    service ... has deregistered 1 target ...
    service ... has started 1 task

    The situation will not resolve itself over a duration of 6 hours.

If a user cancels the cdk deployment script, then:

  1. The CF stack becomes hung in "UPDATE_ROLLBACK_IN_PROGRESS" status
  2. The ECS service completes a successful update and enters a steady state

However, of course the changes in the stack update haven't been applied.

Have reproduced in the following conditions:

  1. The same ECR image is tagged in both the initial and updated deployment
  2. No effective changes to the cluster, service or task definition are deployed - just unrelated changes elsewhere in the stack (e.g. a renamed SSM parameter)

This is pretty severe and it's preventing us from using CDK to manage any ECS infrastructure at all.

Expected Behavior

The CF stack to update successfully on subsequent deployments - and for ECS service updates to successfully happen only when they are necessary. Based on my testing and experimentation, I'm seeing ECS updates being made when nothing about the service has been changed in my code, which is confusing at best.

Current Behavior

As above. Deployments subsequent to the first fail with a hung "UPDATE_IN_PROGRESS" stack, apparently because ECS health checks are failing. Interesting that this occurs even if the changes do not impact any ecs services or tasks - just unrelated changes in the same stack - like an SSM parameter rename or value change.

Reproduction Steps

I'm using CDK through a wrapper package that supplies a bunch of boilerplate for consistent naming and whatnot. Happy to provide more info.

Sample reproduction code (typescript):

addFargateService({
    cpu: ServiceCPUUnits.TWO_VCPU,
    memory: ServiceMemoryLimit.FOUR_GB,
    protocol: ApplicationProtocol.HTTPS,
    desiredInstances: 2,
    certificateArn,
    containerPort: 3000,
    domainName: `${domainPrefix}.some-app.com`,
    ecrRepositoryArn,
    ecrTag: 'latest',
    environmentVariables: {

    },
    hostedZoneAttributes: {
      hostedZoneId,
      zoneName,
    },
    scope,
  });

export const addFargateService = async (options: AddFargateServiceOptions) => {
  const {
    scope,
    certificateArn,
    containerPort = 80,
    cpu = ServiceCPUUnits.HALF_VCPU,
    desiredInstances = 1,
    domainName,
    ecrRepositoryArn,
    ecrTag,
    entryPoint,
    environmentVariables,
    healthCheckPath,
    healthyThresholdCount = 2,
    hostedZoneAttributes,
    logRetention = scope.env === "prod"
      ? RetentionDays.ONE_MONTH
      : RetentionDays.TWO_WEEKS,
    maxInstances = 3,
    memory = ServiceMemoryLimit.HALF_GB,
    minInstances = 1,
    name = "default",
    protocol = ApplicationProtocol.HTTPS,
    scaleAt,
    targetGroupAttributes = {},
    vpcId,
    vpcInterfaceVpcEndpoints = [],
  } = options;

  let cluster = options.cluster;

  // generates a name like `${scope.env}-${name}-ecs`
  const baseName = helpers.baseName(name, "ecs");

  // load balancers can only have short names. prefix with environment so security works
  const loadBalancerParamName = `${scope.ssmPrefix}/names/load-balancer`;

  const loadBalancerName = (await paramExists(loadBalancerParamName))
    ? getRemoteValue(loadBalancerParamName, scope)
    : `${scope.envPrefix}${customAlphabet(alphanumeric, 12)()}`;

  const clusterName = `${scope.appName}-${baseName}`;
  const serviceName = `${clusterName}-service`;
  const taskDefinitionName = `${clusterName}-task-definition`;

  // currently can't use fromHostedZoneId. See: https://github.com/aws/aws-cdk/issues/8406
  // have to use "attributes" which requires from id and name :shrug:
  const domainZone = hostedZoneAttributes
    ? HostedZone.fromHostedZoneAttributes(
        scope,
        `${clusterName}-HostedZone-fromHostedZoneAttributes`,
        hostedZoneAttributes
      )
    : undefined;

  const certificate = certificateArn
    ? Certificate.fromCertificateArn(
        scope,
        `${clusterName}-fromCertificateArn`,
        certificateArn
      )
    : undefined;

  const serviceResponse = new ecsPatterns.ApplicationLoadBalancedFargateService(
    scope,
    serviceName,
    {
      assignPublicIp: true,
      certificate,
      cluster,
      cpu,
      desiredCount: desiredInstances,
      domainName,
      domainZone,
      loadBalancerName,
      memoryLimitMiB: memory,
      protocol,
      publicLoadBalancer: true,
      redirectHTTP: protocol === ApplicationProtocol.HTTPS,
      serviceName,
      taskImageOptions: {
        containerPort: containerPort,
        entryPoint,
        environment: {
          ...environmentVariables,
        },
        family: taskDefinitionName,
        image: ContainerImage.fromEcrRepository(
          Repository.fromRepositoryArn(
            scope,
            `${serviceName}-Repository-fromRepositoryArn`,
            ecrRepositoryArn
          ),
          ecrTag
        ),
        logDriver: LogDriver.awsLogs({
          streamPrefix: serviceName,
          logRetention: logRetention,
        }),
      },
      // if we specify a cluster, we can't specify a vpc.
      vpc: cluster
        ? undefined
        : // otherwise look up from id if provided
        vpcId
        ? Vpc.fromLookup(scope, `${serviceName}-Vpc-fromLookup`, {
            vpcId,
          })
        : // use the default if we weren't given an id.
          Vpc.fromLookup(scope, `${serviceName}-Vpc-fromLookup`, {
            isDefault: true,
          }),
    }
  );

  const { service, targetGroup, taskDefinition } = serviceResponse;

  if (!cluster) {
    cluster = serviceResponse.cluster;
  }

  scope.overrideId(cluster as Cluster, clusterName);
  scope.overrideId(service, serviceName);

  if (healthCheckPath) {
    targetGroup.configureHealthCheck({
      healthyThresholdCount,
      path: healthCheckPath,
    });
  }

  for (const key in targetGroupAttributes) {
    const value = targetGroupAttributes[key];
    targetGroup.setAttribute(key, value);
  }

  const scaling = service.autoScaleTaskCount({
    maxCapacity: maxInstances,
    minCapacity: minInstances,
  });

  if (scaleAt) {
    const { cpuPercent, memoryPercent } = scaleAt;
    if (cpuPercent) {
      scaling.scaleOnCpuUtilization(`${serviceName}-scaling`, {
        targetUtilizationPercent: cpuPercent,
      });
    } else if (memoryPercent) {
      scaling.scaleOnMemoryUtilization(`${serviceName}-scaling`, {
        targetUtilizationPercent: memoryPercent,
      });
    }
  }

  const { vpc } = cluster;

  // this apparently prevents some CF hangs and group ownership problems.
  const securityGroup = addSecurityGroup({
    allowAllOutbound: true,
    id: `${clusterName}-sg`,
    scope,
    vpc,
  });
  const securityGroups = [securityGroup];

  for (const service of vpcInterfaceVpcEndpoints) {
    vpc.addInterfaceEndpoint(`${clusterName}-service-iface-${nanoid()}`, {
      service,
      securityGroups,
    });
  }

  const params = {
    loadBalancerName: addParam({
      id: `${scope.env}-${baseName}-load-balancer-name`,
      name: loadBalancerParamName,
      scope,
      description: `Load balancer name for ${baseName} in ${scope.env}`,
      value: loadBalancerName,
    }),
    url: addParam({
      scope,
      name: `${scope.ssmPrefix}/url/default`,
      value: `https://${domainName}`,
    }),

    clusterArn: addParam({
      scope,
      name: `${scope.ssmPrefix}/arn/cluster`,
      value: cluster?.clusterArn,
    }),

    serviceArn: addParam({
      scope,
      name: `${scope.ssmPrefix}/arn/service`,
      value: service?.serviceArn,
    }),

    taskDefinitionName: addParam({
      scope,
      name: `${scope.ssmPrefix}/name/task-definition`,
      value: taskDefinitionName,
    }),
  };

  return { cluster, params };
};

Sample CF template:

Resources:
  qaDashboardQaDashboardUrlMain:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/url/main
      Tier: Standard
      Type: String
      Value: https://dashboard-qa.recurate-app.com
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-url-main/Resource
  qadashboarddefaultecsserviceLB46C41477:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      LoadBalancerAttributes:
        - Key: deletion_protection.enabled
          Value: "false"
      Name: qa-eRUoN9bGSrwe
      Scheme: internet-facing
      SecurityGroups:
        - Fn::GetAtt:
            - qadashboarddefaultecsserviceLBSecurityGroup9C6EF5EE
            - GroupId
      Subnets:
        - subnet-01de7c3a3fdc86c35
        - subnet-0818b022219eb6648
        - subnet-071233d95301fa614
      Type: application
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/Resource
  qadashboarddefaultecsserviceLBSecurityGroup9C6EF5EE:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Automatically created Security Group for ELB qadashboardstackqadashboarddefaultecsserviceLBBA5B6237
      SecurityGroupIngress:
        - CidrIp: 0.0.0.0/0
          Description: Allow from anyone on port 443
          FromPort: 443
          IpProtocol: tcp
          ToPort: 443
        - CidrIp: 0.0.0.0/0
          Description: Allow from anyone on port 80
          FromPort: 80
          IpProtocol: tcp
          ToPort: 80
      VpcId: vpc-0adb67e60add15afb
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/SecurityGroup/Resource
  qadashboarddefaultecsserviceLBSecurityGrouptoqadashboardstackqadashboarddefaultecsserviceServiceSecurityGroup134D3922300065687CF7:
    Type: AWS::EC2::SecurityGroupEgress
    Properties:
      Description: Load balancer to target
      DestinationSecurityGroupId:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceServiceSecurityGroup94C1C467
          - GroupId
      FromPort: 3000
      GroupId:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceLBSecurityGroup9C6EF5EE
          - GroupId
      IpProtocol: tcp
      ToPort: 3000
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/SecurityGroup/to qadashboardstackqadashboarddefaultecsserviceServiceSecurityGroup134D3922:3000
  qadashboarddefaultecsserviceLBPublicListener3E9E0E16:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      Certificates:
        - CertificateArn: arn:aws:acm:us-east-2:795401590028:certificate/5fb2cca3-625d-4c71-ae03-68612268c22b
      DefaultActions:
        - TargetGroupArn:
            Ref: qadashboarddefaultecsserviceLBPublicListenerECSGroupC201A432
          Type: forward
      LoadBalancerArn:
        Ref: qadashboarddefaultecsserviceLB46C41477
      Port: 443
      Protocol: HTTPS
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/PublicListener/Resource
  qadashboarddefaultecsserviceLBPublicListenerECSGroupC201A432:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Port: 80
      Protocol: HTTP
      TargetGroupAttributes:
        - Key: stickiness.enabled
          Value: "false"
      TargetType: ip
      VpcId: vpc-0adb67e60add15afb
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/PublicListener/ECSGroup/Resource
  qadashboarddefaultecsserviceLBPublicRedirectListener68A68CF3:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      DefaultActions:
        - RedirectConfig:
            Port: "443"
            Protocol: HTTPS
            StatusCode: HTTP_301
          Type: redirect
      LoadBalancerArn:
        Ref: qadashboarddefaultecsserviceLB46C41477
      Port: 80
      Protocol: HTTP
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/PublicRedirectListener/Resource
  qadashboarddefaultecsserviceDNS6D3A1676:
    Type: AWS::Route53::RecordSet
    Properties:
      AliasTarget:
        DNSName:
          Fn::Join:
            - ""
            - - dualstack.
              - Fn::GetAtt:
                  - qadashboarddefaultecsserviceLB46C41477
                  - DNSName
        HostedZoneId:
          Fn::GetAtt:
            - qadashboarddefaultecsserviceLB46C41477
            - CanonicalHostedZoneID
      HostedZoneId: Z010373637UNHIAIR4GRM
      Name: dashboard-qa.recurate-app.com.
      Type: A
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/DNS/Resource
  qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
        Version: "2012-10-17"
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/TaskRole/Resource
  qaDashboardDefaultEcsTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      ContainerDefinitions:
        - Environment:
           ...
          Essential: true
          Image:
            Fn::Join:
              - ""
              - - 795401590028.dkr.ecr.us-east-2.
                - Ref: AWS::URLSuffix
                - /admin-dashboard:latest
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group:
                Ref: qadashboarddefaultecsserviceTaskDefwebLogGroupE025C287
              awslogs-stream-prefix: qa-dashboard-default-ecs-service
              awslogs-region: us-east-2
          Name: web
          PortMappings:
            - ContainerPort: 3000
              Protocol: tcp
      Cpu: "2048"
      ExecutionRoleArn:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceTaskDefExecutionRoleE3593C71
          - Arn
      Family: qadashboardstackqadashboarddefaultecsserviceTaskDef30AF0525
      Memory: "4096"
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      TaskRoleArn:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
          - Arn
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/Resource
  qadashboarddefaultecsserviceTaskDefwebLogGroupE025C287:
    Type: AWS::Logs::LogGroup
    Properties:
      RetentionInDays: 14
    UpdateReplacePolicy: Retain
    DeletionPolicy: Retain
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/web/LogGroup/Resource
  qadashboarddefaultecsserviceTaskDefExecutionRoleE3593C71:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
        Version: "2012-10-17"
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/ExecutionRole/Resource
  qadashboarddefaultecsserviceTaskDefExecutionRoleDefaultPolicyA245E84E:
    Type: AWS::IAM::Policy
    Properties:
      PolicyDocument:
        Statement:
          - Action:
              - ecr:BatchCheckLayerAvailability
              - ecr:GetDownloadUrlForLayer
              - ecr:BatchGetImage
            Effect: Allow
            Resource: arn:aws:ecr:us-east-2:795401590028:repository/admin-dashboard
          - Action: ecr:GetAuthorizationToken
            Effect: Allow
            Resource: "*"
          - Action:
              - logs:CreateLogStream
              - logs:PutLogEvents
            Effect: Allow
            Resource:
              Fn::GetAtt:
                - qadashboarddefaultecsserviceTaskDefwebLogGroupE025C287
                - Arn
        Version: "2012-10-17"
      PolicyName: qadashboarddefaultecsserviceTaskDefExecutionRoleDefaultPolicyA245E84E
      Roles:
        - Ref: qadashboarddefaultecsserviceTaskDefExecutionRoleE3593C71
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/ExecutionRole/DefaultPolicy/Resource
  qaDashboardDefaultEcsService:
    Type: AWS::ECS::Service
    Properties:
      Cluster:
        Ref: qaDashboardDefaultEcs
      DeploymentConfiguration:
        Alarms:
          AlarmNames: []
          Enable: false
          Rollback: false
        MaximumPercent: 200
        MinimumHealthyPercent: 50
      DesiredCount: 2
      EnableECSManagedTags: false
      HealthCheckGracePeriodSeconds: 60
      LaunchType: FARGATE
      LoadBalancers:
        - ContainerName: web
          ContainerPort: 3000
          TargetGroupArn:
            Ref: qadashboarddefaultecsserviceLBPublicListenerECSGroupC201A432
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: ENABLED
          SecurityGroups:
            - Fn::GetAtt:
                - qadashboarddefaultecsserviceServiceSecurityGroup94C1C467
                - GroupId
          Subnets:
            - subnet-01de7c3a3fdc86c35
            - subnet-0818b022219eb6648
            - subnet-071233d95301fa614
      ServiceName: qa-dashboard-default-ecs-service
      TaskDefinition:
        Ref: qaDashboardDefaultEcsTaskDefinition
    DependsOn:
      - qadashboarddefaultecsserviceLBPublicListenerECSGroupC201A432
      - qadashboarddefaultecsserviceLBPublicListener3E9E0E16
      - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/Service
  qadashboarddefaultecsserviceServiceSecurityGroup94C1C467:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/SecurityGroup
      SecurityGroupEgress:
        - CidrIp: 0.0.0.0/0
          Description: Allow all outbound traffic by default
          IpProtocol: "-1"
      VpcId: vpc-0adb67e60add15afb
    DependsOn:
      - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/SecurityGroup/Resource
  qadashboarddefaultecsserviceServiceSecurityGroupfromqadashboardstackqadashboarddefaultecsserviceLBSecurityGroupE0F4F395300098F897D3:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      Description: Load balancer to target
      FromPort: 3000
      GroupId:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceServiceSecurityGroup94C1C467
          - GroupId
      IpProtocol: tcp
      SourceSecurityGroupId:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceLBSecurityGroup9C6EF5EE
          - GroupId
      ToPort: 3000
    DependsOn:
      - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/SecurityGroup/from qadashboardstackqadashboarddefaultecsserviceLBSecurityGroupE0F4F395:3000
  qadashboarddefaultecsserviceServiceTaskCountTargetC1DAD6A0:
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      MaxCapacity: 3
      MinCapacity: 1
      ResourceId:
        Fn::Join:
          - ""
          - - service/
            - Ref: qaDashboardDefaultEcs
            - /
            - Fn::GetAtt:
                - qaDashboardDefaultEcsService
                - Name
      RoleARN:
        Fn::Join:
          - ""
          - - "arn:"
            - Ref: AWS::Partition
            - :iam::795401590028:role/aws-service-role/ecs.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_ECSService
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
    DependsOn:
      - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/TaskCount/Target/Resource
  qaDashboardDefaultEcs:
    Type: AWS::ECS::Cluster
    Metadata:
      aws:cdk:path: qa-dashboard-stack/EcsDefaultClusterMnL3mNNYNqa-dashboard-default-ecs-service-Vpc-fromLookup/Resource
  qadashboardqadashboarddefaultecssgsecuritygroup51619520:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: qa-dashboard-stack/qa-dashboard-qa-dashboard-default-ecs-sg-security-group
      SecurityGroupEgress:
        - CidrIp: 0.0.0.0/0
          Description: Allow all outbound traffic by default
          IpProtocol: "-1"
      VpcId: vpc-0adb67e60add15afb
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-qa-dashboard-default-ecs-sg-security-group/Resource
      cfn_nag:
        rules_to_suppress:
          - id: W5
            reason: Egress of 0.0.0.0/0 is default and generally considered OK
          - id: W40
            reason: Egress IPProtocol of -1 is default and generally considered OK
  qaDashboardQaDefaultEcsLoadBalancerName:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Description: Load balancer name for default-ecs in qa
      Name: /qa/dashboard/names/load-balancer
      Tier: Standard
      Type: String
      Value: qa-eRUoN9bGSrwe
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-default-ecs-load-balancer-name/Resource
  qaDashboardQaDashboardUrlDefault:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/url/default
      Tier: Standard
      Type: String
      Value: https://dashboard-qa.recurate-app.com
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-url-default/Resource
  qaDashboardQaDashboardArnCluster:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/arn/cluster
      Tier: Standard
      Type: String
      Value:
        Fn::GetAtt:
          - qaDashboardDefaultEcs
          - Arn
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-arn-cluster/Resource
  qaDashboardQaDashboardArnService:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/arn/service
      Tier: Standard
      Type: String
      Value:
        Ref: qaDashboardDefaultEcsService
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-arn-service/Resource
  qaDashboardQaDashboardNameTaskDefinition:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/name/task-definition
      Tier: Standard
      Type: String
      Value: qa-dashboard-default-ecs-task-definition
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-name-task-definition/Resource
  CDKMetadata:
    Type: AWS::CDK::Metadata
    Properties:
      Analytics: v2:deflate64:H4sIAAAAAAAA/31SXW/CMAz8LXsP2QAx7XWwD01CGmp5Ryb1qkBIqthlQqj/fW4KHWzTnmJfLr3zuSM9HN/ruxv4pIEptgNn1/qYM5itEmh1JNq1fbS+XECEHTJGNfvwfdMoNFFnWAWyHOJhCoQC0aoClmtP+rGqnDXANvh5gGIKDrzB4gViCYw5xr018sQBsTVOGOvEEMX9SB//fp08XPWXPEuM/sQ51xf3S1FGfo2hrlrKRdsOI5o5mjpaPvSU/4HnMiLRL/jNJ7xR0jBOxjJLhibEomV2VY7cSpI+ntJYAm2f8MN62xrt3F0jwTNYGegC+5Fk8tGVuZF41i59dhZqz2rmajqt8FQ2yoLsOAsuPU3nIkhWh7TnVDXKhVJczkPZR3CuGwXf0ULNgURUlicx9uptwMnXFdI0KkMKdexMv9dc1dyF06GN8qFAvaHb/fBBT+Q33ZC1gyiD2B3qrDu/AEAQDGbDAgAA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/CDKMetadata/Default
Outputs:
  qadashboarddefaultecsserviceLoadBalancerDNSC7850CDA:
    Value:
      Fn::GetAtt:
        - qadashboarddefaultecsserviceLB46C41477
        - DNSName
  qadashboarddefaultecsserviceServiceURL3FC394E2:
    Value:
      Fn::Join:
        - ""
        - - https://
          - Ref: qadashboarddefaultecsserviceDNS6D3A1676
Parameters:
  BootstrapVersion:
    Type: AWS::SSM::Parameter::Value<String>
    Default: /cdk-bootstrap/hnb659fds/version
    Description: Version of the CDK Bootstrap resources in this environment, automatically retrieved from SSM Parameter Store. [cdk:skip]
Rules:
  CheckBootstrapVersion:
    Assertions:
      - Assert:
          Fn::Not:
            - Fn::Contains:
                - - "1"
                  - "2"
                  - "3"
                  - "4"
                  - "5"
                - Ref: BootstrapVersion
        AssertDescription: CDK bootstrap stack version 6 required. Please run 'cdk bootstrap' with a recent version of the CDK CLI.

Possible Solution

No response

Additional Information/Context

Open to alternative suggestions or workarounds. Landed on ecs-patterns because it was the quickest way to get a service up and running from scratch, not married to it.

CDK CLI Version

2.139.1 (and also 2.147.2)

Framework Version

No response

Node.js Version

18 and 21

OS

Linux Ubuntu (real and github workflow runner image)

Language

TypeScript

Language Version

5.0.4 and 5.5.3

Other information

No response

lamontadams commented 3 days ago

After some more testing, this absolutely has something to do with deploying when there's a new ECR image waiting to be picked up by the task. With some tweaks to health check grace period, I can deploy all day long with no issue, but as soon as a new container image is waiting everything goes bonkers and I have to trash the stack and scratch deploy to recover.

This is extremely frustrating, would love to have a workaround.

rantoniuk commented 2 days ago

(Just saw this and maybe I'll give a helping hand - since I had a very similar issue with 10GB images)

You probably have a large container image that takes long to provision (download from ECR) and too short healthchecks. Check ECS logs and Service Events tab, that could shed some light as well.

lamontadams commented 2 days ago

(Just saw this and maybe I'll give a helping hand - since I had a very similar issue with 10GB images)

You probably have a large container image that takes long to provision (download from ECR) and too short healthchecks. Check ECS logs and Service Events tab, that could shed some light as well.

Thanks for this - in this case these images are relatively small, 200-300MB. I seem to recall seeing log output indicating that they start successfully but I'll pay attention the next time I try this. Like I said in the bug report, the events tab just shows an endlessly repeating cycle of start, unhealthy, stop, de-register.

I ground away on this all day yesterday and part of my problem seems to be that the defaults are a little asinine. By default, the deployment circuit breaker is disabled, and the minHealthyPercent value appears to be 100. Which seems to me like a recipe for a deadlocked deployment any time you have desiredCount > 1.

I turned on the circuit breaker, set a generous grace period, and minHealthyPercent to 50:

      circuitBreaker: {
        enable: true,
        rollback: true,
      },
      desiredCount: 2,
      healthCheckGracePeriod: Duration.minutes(5),
      minHealthyPercent: 50,

And the situation is a little better - the circuit breaker did detect a deadlocked deployment and cancelled it... after 4 hours. At least the stack isn't stuck in an endless update, I guess?

My last gasp here is experimenting with just deploying a dummy "hello world" image to get the infrastructure set, and pushing actual image updates in response to git pushes via a CLI script. Which is, frankly, precisely the kind of situation I look to CDK to help me avoid.

If that doesn't work then I'll give up and look for some canned terraform.

Edit to add, FWIW, I have a working cluster that was hand-configured and the images I'm deploying here work fine there, so this doesn't feel like an image problem.

lamontadams commented 8 hours ago

This just seems to be broken and unusable for me.

If I build, push and tag an image to ECR and then force a deployment via aws ecs update-service --force-new-deployment the service updates normally and is stable. I can watch the container start and see it answering health checks in the ECS Service logs in the console.

If, however, I use ApplicationLoadBalancedFargateService to force a deployment on the same existing service, either by supplying a different ecr tag or forcing a new task definition by modifying environment variables, the deployment reliably hangs and triggers the circuit breaker (now that I've enabled it - I still think the default-disable behavior is silly). In this case, I never see the container start in the ECS Service logs, which is really wild because it's the same image.