aws-ecs: ApplicationLoadBalancedFargateService generates stacks hung in "UPDATE_IN_PROGRESS" and failed health checks

lamontadams commented 5 months ago

Describe the bug

Initial deployments using ApplicationLoadBalancedFargateService from ecs-patterns complete successfully and generate working, healthy, reachable services. All subsequent deployments fail with a particular series of events:

The CF stack hangs in "UPDATE_IN_PROGRESS" status
CDK hangs at "UPDATE_ IN_PROGRESS | AWS:ECS:SERVICE"
The ECS cluster console shows "Services: Draining" and "Tasks:Pending"

The ECS service accumulates deployment failures and shows this cycle of events in the event log:

service ... has started 1 tasks
service ... port 3000 is unhealthy (reason Health checks failed).
service ... has stopped 1 running tasks
service ... has deregistered 1 target ...
service ... has started 1 task

The situation will not resolve itself over a duration of 6 hours.

If a user cancels the cdk deployment script, then:

The CF stack becomes hung in "UPDATE_ROLLBACK_IN_PROGRESS" status
The ECS service completes a successful update and enters a steady state

However, of course the changes in the stack update haven't been applied.

Have reproduced in the following conditions:

The same ECR image is tagged in both the initial and updated deployment
No effective changes to the cluster, service or task definition are deployed - just unrelated changes elsewhere in the stack (e.g. a renamed SSM parameter)

This is pretty severe and it's preventing us from using CDK to manage any ECS infrastructure at all.

Expected Behavior

The CF stack to update successfully on subsequent deployments - and for ECS service updates to successfully happen only when they are necessary. Based on my testing and experimentation, I'm seeing ECS updates being made when nothing about the service has been changed in my code, which is confusing at best.

Current Behavior

As above. Deployments subsequent to the first fail with a hung "UPDATE_IN_PROGRESS" stack, apparently because ECS health checks are failing. Interesting that this occurs even if the changes do not impact any ecs services or tasks - just unrelated changes in the same stack - like an SSM parameter rename or value change.

Reproduction Steps

I'm using CDK through a wrapper package that supplies a bunch of boilerplate for consistent naming and whatnot. Happy to provide more info.

Sample reproduction code (typescript):

addFargateService({
    cpu: ServiceCPUUnits.TWO_VCPU,
    memory: ServiceMemoryLimit.FOUR_GB,
    protocol: ApplicationProtocol.HTTPS,
    desiredInstances: 2,
    certificateArn,
    containerPort: 3000,
    domainName: `${domainPrefix}.some-app.com`,
    ecrRepositoryArn,
    ecrTag: 'latest',
    environmentVariables: {

    },
    hostedZoneAttributes: {
      hostedZoneId,
      zoneName,
    },
    scope,
  });

export const addFargateService = async (options: AddFargateServiceOptions) => {
  const {
    scope,
    certificateArn,
    containerPort = 80,
    cpu = ServiceCPUUnits.HALF_VCPU,
    desiredInstances = 1,
    domainName,
    ecrRepositoryArn,
    ecrTag,
    entryPoint,
    environmentVariables,
    healthCheckPath,
    healthyThresholdCount = 2,
    hostedZoneAttributes,
    logRetention = scope.env === "prod"
      ? RetentionDays.ONE_MONTH
      : RetentionDays.TWO_WEEKS,
    maxInstances = 3,
    memory = ServiceMemoryLimit.HALF_GB,
    minInstances = 1,
    name = "default",
    protocol = ApplicationProtocol.HTTPS,
    scaleAt,
    targetGroupAttributes = {},
    vpcId,
    vpcInterfaceVpcEndpoints = [],
  } = options;

  let cluster = options.cluster;

  // generates a name like `${scope.env}-${name}-ecs`
  const baseName = helpers.baseName(name, "ecs");

  // load balancers can only have short names. prefix with environment so security works
  const loadBalancerParamName = `${scope.ssmPrefix}/names/load-balancer`;

  const loadBalancerName = (await paramExists(loadBalancerParamName))
    ? getRemoteValue(loadBalancerParamName, scope)
    : `${scope.envPrefix}${customAlphabet(alphanumeric, 12)()}`;

  const clusterName = `${scope.appName}-${baseName}`;
  const serviceName = `${clusterName}-service`;
  const taskDefinitionName = `${clusterName}-task-definition`;

  // currently can't use fromHostedZoneId. See: https://github.com/aws/aws-cdk/issues/8406
  // have to use "attributes" which requires from id and name :shrug:
  const domainZone = hostedZoneAttributes
    ? HostedZone.fromHostedZoneAttributes(
        scope,
        `${clusterName}-HostedZone-fromHostedZoneAttributes`,
        hostedZoneAttributes
      )
    : undefined;

  const certificate = certificateArn
    ? Certificate.fromCertificateArn(
        scope,
        `${clusterName}-fromCertificateArn`,
        certificateArn
      )
    : undefined;

  const serviceResponse = new ecsPatterns.ApplicationLoadBalancedFargateService(
    scope,
    serviceName,
    {
      assignPublicIp: true,
      certificate,
      cluster,
      cpu,
      desiredCount: desiredInstances,
      domainName,
      domainZone,
      loadBalancerName,
      memoryLimitMiB: memory,
      protocol,
      publicLoadBalancer: true,
      redirectHTTP: protocol === ApplicationProtocol.HTTPS,
      serviceName,
      taskImageOptions: {
        containerPort: containerPort,
        entryPoint,
        environment: {
          ...environmentVariables,
        },
        family: taskDefinitionName,
        image: ContainerImage.fromEcrRepository(
          Repository.fromRepositoryArn(
            scope,
            `${serviceName}-Repository-fromRepositoryArn`,
            ecrRepositoryArn
          ),
          ecrTag
        ),
        logDriver: LogDriver.awsLogs({
          streamPrefix: serviceName,
          logRetention: logRetention,
        }),
      },
      // if we specify a cluster, we can't specify a vpc.
      vpc: cluster
        ? undefined
        : // otherwise look up from id if provided
        vpcId
        ? Vpc.fromLookup(scope, `${serviceName}-Vpc-fromLookup`, {
            vpcId,
          })
        : // use the default if we weren't given an id.
          Vpc.fromLookup(scope, `${serviceName}-Vpc-fromLookup`, {
            isDefault: true,
          }),
    }
  );

  const { service, targetGroup, taskDefinition } = serviceResponse;

  if (!cluster) {
    cluster = serviceResponse.cluster;
  }

  scope.overrideId(cluster as Cluster, clusterName);
  scope.overrideId(service, serviceName);

  if (healthCheckPath) {
    targetGroup.configureHealthCheck({
      healthyThresholdCount,
      path: healthCheckPath,
    });
  }

  for (const key in targetGroupAttributes) {
    const value = targetGroupAttributes[key];
    targetGroup.setAttribute(key, value);
  }

  const scaling = service.autoScaleTaskCount({
    maxCapacity: maxInstances,
    minCapacity: minInstances,
  });

  if (scaleAt) {
    const { cpuPercent, memoryPercent } = scaleAt;
    if (cpuPercent) {
      scaling.scaleOnCpuUtilization(`${serviceName}-scaling`, {
        targetUtilizationPercent: cpuPercent,
      });
    } else if (memoryPercent) {
      scaling.scaleOnMemoryUtilization(`${serviceName}-scaling`, {
        targetUtilizationPercent: memoryPercent,
      });
    }
  }

  const { vpc } = cluster;

  // this apparently prevents some CF hangs and group ownership problems.
  const securityGroup = addSecurityGroup({
    allowAllOutbound: true,
    id: `${clusterName}-sg`,
    scope,
    vpc,
  });
  const securityGroups = [securityGroup];

  for (const service of vpcInterfaceVpcEndpoints) {
    vpc.addInterfaceEndpoint(`${clusterName}-service-iface-${nanoid()}`, {
      service,
      securityGroups,
    });
  }

  const params = {
    loadBalancerName: addParam({
      id: `${scope.env}-${baseName}-load-balancer-name`,
      name: loadBalancerParamName,
      scope,
      description: `Load balancer name for ${baseName} in ${scope.env}`,
      value: loadBalancerName,
    }),
    url: addParam({
      scope,
      name: `${scope.ssmPrefix}/url/default`,
      value: `https://${domainName}`,
    }),

    clusterArn: addParam({
      scope,
      name: `${scope.ssmPrefix}/arn/cluster`,
      value: cluster?.clusterArn,
    }),

    serviceArn: addParam({
      scope,
      name: `${scope.ssmPrefix}/arn/service`,
      value: service?.serviceArn,
    }),

    taskDefinitionName: addParam({
      scope,
      name: `${scope.ssmPrefix}/name/task-definition`,
      value: taskDefinitionName,
    }),
  };

  return { cluster, params };
};

Sample CF template:

Resources:
  qaDashboardQaDashboardUrlMain:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/url/main
      Tier: Standard
      Type: String
      Value: https://dashboard-qa.recurate-app.com
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-url-main/Resource
  qadashboarddefaultecsserviceLB46C41477:
    Type: AWS::ElasticLoadBalancingV2::LoadBalancer
    Properties:
      LoadBalancerAttributes:
        - Key: deletion_protection.enabled
          Value: "false"
      Name: qa-eRUoN9bGSrwe
      Scheme: internet-facing
      SecurityGroups:
        - Fn::GetAtt:
            - qadashboarddefaultecsserviceLBSecurityGroup9C6EF5EE
            - GroupId
      Subnets:
        - subnet-01de7c3a3fdc86c35
        - subnet-0818b022219eb6648
        - subnet-071233d95301fa614
      Type: application
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/Resource
  qadashboarddefaultecsserviceLBSecurityGroup9C6EF5EE:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Automatically created Security Group for ELB qadashboardstackqadashboarddefaultecsserviceLBBA5B6237
      SecurityGroupIngress:
        - CidrIp: 0.0.0.0/0
          Description: Allow from anyone on port 443
          FromPort: 443
          IpProtocol: tcp
          ToPort: 443
        - CidrIp: 0.0.0.0/0
          Description: Allow from anyone on port 80
          FromPort: 80
          IpProtocol: tcp
          ToPort: 80
      VpcId: vpc-0adb67e60add15afb
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/SecurityGroup/Resource
  qadashboarddefaultecsserviceLBSecurityGrouptoqadashboardstackqadashboarddefaultecsserviceServiceSecurityGroup134D3922300065687CF7:
    Type: AWS::EC2::SecurityGroupEgress
    Properties:
      Description: Load balancer to target
      DestinationSecurityGroupId:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceServiceSecurityGroup94C1C467
          - GroupId
      FromPort: 3000
      GroupId:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceLBSecurityGroup9C6EF5EE
          - GroupId
      IpProtocol: tcp
      ToPort: 3000
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/SecurityGroup/to qadashboardstackqadashboarddefaultecsserviceServiceSecurityGroup134D3922:3000
  qadashboarddefaultecsserviceLBPublicListener3E9E0E16:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      Certificates:
        - CertificateArn: arn:aws:acm:us-east-2:795401590028:certificate/5fb2cca3-625d-4c71-ae03-68612268c22b
      DefaultActions:
        - TargetGroupArn:
            Ref: qadashboarddefaultecsserviceLBPublicListenerECSGroupC201A432
          Type: forward
      LoadBalancerArn:
        Ref: qadashboarddefaultecsserviceLB46C41477
      Port: 443
      Protocol: HTTPS
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/PublicListener/Resource
  qadashboarddefaultecsserviceLBPublicListenerECSGroupC201A432:
    Type: AWS::ElasticLoadBalancingV2::TargetGroup
    Properties:
      Port: 80
      Protocol: HTTP
      TargetGroupAttributes:
        - Key: stickiness.enabled
          Value: "false"
      TargetType: ip
      VpcId: vpc-0adb67e60add15afb
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/PublicListener/ECSGroup/Resource
  qadashboarddefaultecsserviceLBPublicRedirectListener68A68CF3:
    Type: AWS::ElasticLoadBalancingV2::Listener
    Properties:
      DefaultActions:
        - RedirectConfig:
            Port: "443"
            Protocol: HTTPS
            StatusCode: HTTP_301
          Type: redirect
      LoadBalancerArn:
        Ref: qadashboarddefaultecsserviceLB46C41477
      Port: 80
      Protocol: HTTP
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/LB/PublicRedirectListener/Resource
  qadashboarddefaultecsserviceDNS6D3A1676:
    Type: AWS::Route53::RecordSet
    Properties:
      AliasTarget:
        DNSName:
          Fn::Join:
            - ""
            - - dualstack.
              - Fn::GetAtt:
                  - qadashboarddefaultecsserviceLB46C41477
                  - DNSName
        HostedZoneId:
          Fn::GetAtt:
            - qadashboarddefaultecsserviceLB46C41477
            - CanonicalHostedZoneID
      HostedZoneId: Z010373637UNHIAIR4GRM
      Name: dashboard-qa.recurate-app.com.
      Type: A
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/DNS/Resource
  qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
        Version: "2012-10-17"
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/TaskRole/Resource
  qaDashboardDefaultEcsTaskDefinition:
    Type: AWS::ECS::TaskDefinition
    Properties:
      ContainerDefinitions:
        - Environment:
           ...
          Essential: true
          Image:
            Fn::Join:
              - ""
              - - 795401590028.dkr.ecr.us-east-2.
                - Ref: AWS::URLSuffix
                - /admin-dashboard:latest
          LogConfiguration:
            LogDriver: awslogs
            Options:
              awslogs-group:
                Ref: qadashboarddefaultecsserviceTaskDefwebLogGroupE025C287
              awslogs-stream-prefix: qa-dashboard-default-ecs-service
              awslogs-region: us-east-2
          Name: web
          PortMappings:
            - ContainerPort: 3000
              Protocol: tcp
      Cpu: "2048"
      ExecutionRoleArn:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceTaskDefExecutionRoleE3593C71
          - Arn
      Family: qadashboardstackqadashboarddefaultecsserviceTaskDef30AF0525
      Memory: "4096"
      NetworkMode: awsvpc
      RequiresCompatibilities:
        - FARGATE
      TaskRoleArn:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
          - Arn
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/Resource
  qadashboarddefaultecsserviceTaskDefwebLogGroupE025C287:
    Type: AWS::Logs::LogGroup
    Properties:
      RetentionInDays: 14
    UpdateReplacePolicy: Retain
    DeletionPolicy: Retain
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/web/LogGroup/Resource
  qadashboarddefaultecsserviceTaskDefExecutionRoleE3593C71:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Action: sts:AssumeRole
            Effect: Allow
            Principal:
              Service: ecs-tasks.amazonaws.com
        Version: "2012-10-17"
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/ExecutionRole/Resource
  qadashboarddefaultecsserviceTaskDefExecutionRoleDefaultPolicyA245E84E:
    Type: AWS::IAM::Policy
    Properties:
      PolicyDocument:
        Statement:
          - Action:
              - ecr:BatchCheckLayerAvailability
              - ecr:GetDownloadUrlForLayer
              - ecr:BatchGetImage
            Effect: Allow
            Resource: arn:aws:ecr:us-east-2:795401590028:repository/admin-dashboard
          - Action: ecr:GetAuthorizationToken
            Effect: Allow
            Resource: "*"
          - Action:
              - logs:CreateLogStream
              - logs:PutLogEvents
            Effect: Allow
            Resource:
              Fn::GetAtt:
                - qadashboarddefaultecsserviceTaskDefwebLogGroupE025C287
                - Arn
        Version: "2012-10-17"
      PolicyName: qadashboarddefaultecsserviceTaskDefExecutionRoleDefaultPolicyA245E84E
      Roles:
        - Ref: qadashboarddefaultecsserviceTaskDefExecutionRoleE3593C71
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/TaskDef/ExecutionRole/DefaultPolicy/Resource
  qaDashboardDefaultEcsService:
    Type: AWS::ECS::Service
    Properties:
      Cluster:
        Ref: qaDashboardDefaultEcs
      DeploymentConfiguration:
        Alarms:
          AlarmNames: []
          Enable: false
          Rollback: false
        MaximumPercent: 200
        MinimumHealthyPercent: 50
      DesiredCount: 2
      EnableECSManagedTags: false
      HealthCheckGracePeriodSeconds: 60
      LaunchType: FARGATE
      LoadBalancers:
        - ContainerName: web
          ContainerPort: 3000
          TargetGroupArn:
            Ref: qadashboarddefaultecsserviceLBPublicListenerECSGroupC201A432
      NetworkConfiguration:
        AwsvpcConfiguration:
          AssignPublicIp: ENABLED
          SecurityGroups:
            - Fn::GetAtt:
                - qadashboarddefaultecsserviceServiceSecurityGroup94C1C467
                - GroupId
          Subnets:
            - subnet-01de7c3a3fdc86c35
            - subnet-0818b022219eb6648
            - subnet-071233d95301fa614
      ServiceName: qa-dashboard-default-ecs-service
      TaskDefinition:
        Ref: qaDashboardDefaultEcsTaskDefinition
    DependsOn:
      - qadashboarddefaultecsserviceLBPublicListenerECSGroupC201A432
      - qadashboarddefaultecsserviceLBPublicListener3E9E0E16
      - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/Service
  qadashboarddefaultecsserviceServiceSecurityGroup94C1C467:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/SecurityGroup
      SecurityGroupEgress:
        - CidrIp: 0.0.0.0/0
          Description: Allow all outbound traffic by default
          IpProtocol: "-1"
      VpcId: vpc-0adb67e60add15afb
    DependsOn:
      - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/SecurityGroup/Resource
  qadashboarddefaultecsserviceServiceSecurityGroupfromqadashboardstackqadashboarddefaultecsserviceLBSecurityGroupE0F4F395300098F897D3:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      Description: Load balancer to target
      FromPort: 3000
      GroupId:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceServiceSecurityGroup94C1C467
          - GroupId
      IpProtocol: tcp
      SourceSecurityGroupId:
        Fn::GetAtt:
          - qadashboarddefaultecsserviceLBSecurityGroup9C6EF5EE
          - GroupId
      ToPort: 3000
    DependsOn:
      - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/SecurityGroup/from qadashboardstackqadashboarddefaultecsserviceLBSecurityGroupE0F4F395:3000
  qadashboarddefaultecsserviceServiceTaskCountTargetC1DAD6A0:
    Type: AWS::ApplicationAutoScaling::ScalableTarget
    Properties:
      MaxCapacity: 3
      MinCapacity: 1
      ResourceId:
        Fn::Join:
          - ""
          - - service/
            - Ref: qaDashboardDefaultEcs
            - /
            - Fn::GetAtt:
                - qaDashboardDefaultEcsService
                - Name
      RoleARN:
        Fn::Join:
          - ""
          - - "arn:"
            - Ref: AWS::Partition
            - :iam::795401590028:role/aws-service-role/ecs.application-autoscaling.amazonaws.com/AWSServiceRoleForApplicationAutoScaling_ECSService
      ScalableDimension: ecs:service:DesiredCount
      ServiceNamespace: ecs
    DependsOn:
      - qadashboarddefaultecsserviceTaskDefTaskRole39AF0AEA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-default-ecs-service/Service/TaskCount/Target/Resource
  qaDashboardDefaultEcs:
    Type: AWS::ECS::Cluster
    Metadata:
      aws:cdk:path: qa-dashboard-stack/EcsDefaultClusterMnL3mNNYNqa-dashboard-default-ecs-service-Vpc-fromLookup/Resource
  qadashboardqadashboarddefaultecssgsecuritygroup51619520:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: qa-dashboard-stack/qa-dashboard-qa-dashboard-default-ecs-sg-security-group
      SecurityGroupEgress:
        - CidrIp: 0.0.0.0/0
          Description: Allow all outbound traffic by default
          IpProtocol: "-1"
      VpcId: vpc-0adb67e60add15afb
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-dashboard-qa-dashboard-default-ecs-sg-security-group/Resource
      cfn_nag:
        rules_to_suppress:
          - id: W5
            reason: Egress of 0.0.0.0/0 is default and generally considered OK
          - id: W40
            reason: Egress IPProtocol of -1 is default and generally considered OK
  qaDashboardQaDefaultEcsLoadBalancerName:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Description: Load balancer name for default-ecs in qa
      Name: /qa/dashboard/names/load-balancer
      Tier: Standard
      Type: String
      Value: qa-eRUoN9bGSrwe
    Metadata:
      aws:cdk:path: qa-dashboard-stack/qa-default-ecs-load-balancer-name/Resource
  qaDashboardQaDashboardUrlDefault:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/url/default
      Tier: Standard
      Type: String
      Value: https://dashboard-qa.recurate-app.com
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-url-default/Resource
  qaDashboardQaDashboardArnCluster:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/arn/cluster
      Tier: Standard
      Type: String
      Value:
        Fn::GetAtt:
          - qaDashboardDefaultEcs
          - Arn
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-arn-cluster/Resource
  qaDashboardQaDashboardArnService:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/arn/service
      Tier: Standard
      Type: String
      Value:
        Ref: qaDashboardDefaultEcsService
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-arn-service/Resource
  qaDashboardQaDashboardNameTaskDefinition:
    Type: AWS::SSM::Parameter
    Properties:
      AllowedPattern: .*
      Name: /qa/dashboard/name/task-definition
      Tier: Standard
      Type: String
      Value: qa-dashboard-default-ecs-task-definition
    Metadata:
      aws:cdk:path: qa-dashboard-stack/-qa-dashboard-name-task-definition/Resource
  CDKMetadata:
    Type: AWS::CDK::Metadata
    Properties:
      Analytics: v2:deflate64:H4sIAAAAAAAA/31SXW/CMAz8LXsP2QAx7XWwD01CGmp5Ryb1qkBIqthlQqj/fW4KHWzTnmJfLr3zuSM9HN/ruxv4pIEptgNn1/qYM5itEmh1JNq1fbS+XECEHTJGNfvwfdMoNFFnWAWyHOJhCoQC0aoClmtP+rGqnDXANvh5gGIKDrzB4gViCYw5xr018sQBsTVOGOvEEMX9SB//fp08XPWXPEuM/sQ51xf3S1FGfo2hrlrKRdsOI5o5mjpaPvSU/4HnMiLRL/jNJ7xR0jBOxjJLhibEomV2VY7cSpI+ntJYAm2f8MN62xrt3F0jwTNYGegC+5Fk8tGVuZF41i59dhZqz2rmajqt8FQ2yoLsOAsuPU3nIkhWh7TnVDXKhVJczkPZR3CuGwXf0ULNgURUlicx9uptwMnXFdI0KkMKdexMv9dc1dyF06GN8qFAvaHb/fBBT+Q33ZC1gyiD2B3qrDu/AEAQDGbDAgAA
    Metadata:
      aws:cdk:path: qa-dashboard-stack/CDKMetadata/Default
Outputs:
  qadashboarddefaultecsserviceLoadBalancerDNSC7850CDA:
    Value:
      Fn::GetAtt:
        - qadashboarddefaultecsserviceLB46C41477
        - DNSName
  qadashboarddefaultecsserviceServiceURL3FC394E2:
    Value:
      Fn::Join:
        - ""
        - - https://
          - Ref: qadashboarddefaultecsserviceDNS6D3A1676
Parameters:
  BootstrapVersion:
    Type: AWS::SSM::Parameter::Value<String>
    Default: /cdk-bootstrap/hnb659fds/version
    Description: Version of the CDK Bootstrap resources in this environment, automatically retrieved from SSM Parameter Store. [cdk:skip]
Rules:
  CheckBootstrapVersion:
    Assertions:
      - Assert:
          Fn::Not:
            - Fn::Contains:
                - - "1"
                  - "2"
                  - "3"
                  - "4"
                  - "5"
                - Ref: BootstrapVersion
        AssertDescription: CDK bootstrap stack version 6 required. Please run 'cdk bootstrap' with a recent version of the CDK CLI.

Possible Solution

No response

Additional Information/Context

Open to alternative suggestions or workarounds. Landed on ecs-patterns because it was the quickest way to get a service up and running from scratch, not married to it.

CDK CLI Version

2.139.1 (and also 2.147.2)

Framework Version

No response

Node.js Version

18 and 21

OS

Linux Ubuntu (real and github workflow runner image)

Language

TypeScript

Language Version

5.0.4 and 5.5.3

Other information

No response

lamontadams commented 5 months ago

After some more testing, this absolutely has something to do with deploying when there's a new ECR image waiting to be picked up by the task. With some tweaks to health check grace period, I can deploy all day long with no issue, but as soon as a new container image is waiting everything goes bonkers and I have to trash the stack and scratch deploy to recover.

This is extremely frustrating, would love to have a workaround.

rantoniuk commented 5 months ago

(Just saw this and maybe I'll give a helping hand - since I had a very similar issue with 10GB images)

You probably have a large container image that takes long to provision (download from ECR) and too short healthchecks. Check ECS logs and Service Events tab, that could shed some light as well.

lamontadams commented 5 months ago

(Just saw this and maybe I'll give a helping hand - since I had a very similar issue with 10GB images)

You probably have a large container image that takes long to provision (download from ECR) and too short healthchecks. Check ECS logs and Service Events tab, that could shed some light as well.

Thanks for this - in this case these images are relatively small, 200-300MB. I seem to recall seeing log output indicating that they start successfully but I'll pay attention the next time I try this. Like I said in the bug report, the events tab just shows an endlessly repeating cycle of start, unhealthy, stop, de-register.

I ground away on this all day yesterday and part of my problem seems to be that the defaults are a little asinine. By default, the deployment circuit breaker is disabled, and the minHealthyPercent value appears to be 100. Which seems to me like a recipe for a deadlocked deployment any time you have desiredCount > 1.

I turned on the circuit breaker, set a generous grace period, and minHealthyPercent to 50:

      circuitBreaker: {
        enable: true,
        rollback: true,
      },
      desiredCount: 2,
      healthCheckGracePeriod: Duration.minutes(5),
      minHealthyPercent: 50,

And the situation is a little better - the circuit breaker did detect a deadlocked deployment and cancelled it... after 4 hours. At least the stack isn't stuck in an endless update, I guess?

My last gasp here is experimenting with just deploying a dummy "hello world" image to get the infrastructure set, and pushing actual image updates in response to git pushes via a CLI script. Which is, frankly, precisely the kind of situation I look to CDK to help me avoid.

If that doesn't work then I'll give up and look for some canned terraform.

Edit to add, FWIW, I have a working cluster that was hand-configured and the images I'm deploying here work fine there, so this doesn't feel like an image problem.

lamontadams commented 5 months ago

This just seems to be broken and unusable for me.

If I build, push and tag an image to ECR and then force a deployment via aws ecs update-service --force-new-deployment the service updates normally and is stable. I can watch the container start and see it answering health checks in the ECS Service logs in the console.

If, however, I use ApplicationLoadBalancedFargateService to force a deployment on the same existing service, either by supplying a different ecr tag or forcing a new task definition by modifying environment variables, the deployment reliably hangs and triggers the circuit breaker (now that I've enabled it - I still think the default-disable behavior is silly). In this case, I never see the container start in the ECS Service logs, which is really wild because it's the same image.

pahud commented 4 months ago

Hi

Let me explain a little bit about this.

CDK deploys ECS services via cloudformation(CFN in short). In CFN, ECS service deployment has to enter a stable state before CFN enters the CREATE_COMPLETE state, which is by design from CFN. What's happening under the hood is that CFN has to make sure:

The ecs service you deployed or updated has completed its rolling update and enters the RUNNING state.
Health checks have to be successful (depends on your configuration, the health check could be on load balancer or on container health check configuration)
If all above are good, CFN would set AWS:ECS::Service as CREATE_COMPLETE or UPDATE_COMPLETE.

With AWS CLI, when you run aws ecs update-service --force-new-deployment. AWS CLI will immediate return the status without checking if the service has completed its rolling update nor if all health checks are passed. That being said, the operations behind the scene are totally different.

Looks like your initial deployment is good and it only fails on your update on the existing deployment?

I would like to know:

Before you cdk deploy to update your existing successful initial deployment, can you share your cdk diff output to see what would be changed?
After you update your deployment, are you seeing the AWS::ECS::Service stay in UPDATE_IN_PROGRESS status? If you go to the ECS console to view the service, can you tell if they have completed the health checks? Are you seeing them being terminated and recreated due to bad health checks or any other reasons?
Are you able to see your container logs from CloudWatch Logs, was your application in your container successfully running or did it exit due to some unexpected reasons? Bad and failed command execution could result into unsuccessful health checks. Sometimes the health checks will need more graceful time before it starts the first health check because the container may need to pull large images or the application startup time might take longer before it is ready to serve traffic. You will need to observe its log and activities/events from ECS console to determine the root cause.
Try to simplify your ECS service deployment without using circus breaker or any other non-necessary features. This would help you simplify your CDK design and really focus on what really matters to ensure its core funtionalities.

Hope it helps!

lamontadams commented 4 months ago

Hi, and thanks for the reply.

I understand there's some very complex interaction between CDK and CFN and ECS and that both of the latter are by themselves extremely complex systems. I have kind of moved on here since I was not able to get deployments to work reliably. I'm now just using cdk to do initial environment setup, and using ecs cli commands to do all subsequent task updates. Which is far from ideal, but works.

I believe I have narrowed things down to:

Given an image published to ECR with a tag, if I cdk deploy a new stack using ecsPatterns.ApplicationLoadBalancedFargateService (so we're creating a new ECS cluster and all it's supporting stuff) which references that tag, the deployment succeeds, the ECS services all start successfully, and the CFN stack ends in CREATE_COMPLETE.

If I then modify anything which would cause a new task definition to be created, (e.g. change one of the task definition environment values via taskImageOptions.environment) EDIT TO ADD (crucially, I think): using the same image and tag , then a subsequent deployment will trigger the circuit breaker (if it's been explicitly enabled, see below) and the update will fail. I have not done a diff here but I'm confident from comparing synth output that this is all that's changed.

In both situations, I have been able to see output from running task images indicating to me they have started in both ECS and Cloudwatch logs, and they seem to be running before what I understand through experimentation to be the controlling metric (the health check grace period) has elapsed.

I'm also reporting that I believe the default circuitBreaker.enable value should be "true" because in the above situation, with the current default value of "false", it's in my experience very easy (indeed almost guaranteed) to wind up with an ECS update that never finishes (remaining locked in a cycle where it is restarting new tasks) and a CFN stack which therefore remains long-term (5+ hours) stuck in "UPDATE_IN_PROGRESS" status. The only way I found to resolve this situation is to manually intervene in the console by deleting the ECS services and cluster, cancel the CFN stack update, and then destroy the stack.

That's a terrible user experience, and frankly if this were my first attempt at provisioning infrastructure via CDK (I am, in fact, very successfully using it to manage a large cloud-native platform), I would have put it down, walked away and never looked back.

aws / aws-cdk