aws-cloudformation / cloudformation-coverage-roadmap

The AWS CloudFormation Public Coverage Roadmap
https://aws.amazon.com/cloudformation/
Creative Commons Attribution Share Alike 4.0 International
1.11k stars 56 forks source link

Cloudformation does not wait for ECS service to stabilize #1391

Open emspoars opened 2 years ago

emspoars commented 2 years ago

Name of the resource

AWS::ECS::Service

Resource Name

No response

Issue Description

Cloudformation, when used to deploy an ECS service, does not always respect the fact that the service should reach a steady state before receiving the "CREATE COMPLETE/UPDATE_COMPLETE" status.

If an invalid image if specified in the corresponding task, the result is as expected - the service does not reach a steady state and the resource creation/update fails.

If, however, the image exists but the service does not reach a steady state because of an ECS health check, Cloudformation simply considers the resource as OK and allows it to reach "CREATE COMPLETE/UPDATE_COMPLETE" status, even before ECS gives up starting the service.

Here are other relevant issues about this:

Expected Behavior

Cloudformation, when used to deploy an ECS service, should wait for the service to reach a steady state before giving the resource a CREATE_COMPLETE or UPDATE_COMPLETE status.

Observed Behavior

Cloudformation, when used to deploy an ECS service, gives the resource a CREATE_COMPLETE/UPDATE_COMPLETE status before the service reaches a steady state.

Test Cases

The resource in Cloudformation will reach the CREATE_COMPLETE/UPDATE_COMPLETE status but the ECS service will not stabilize.

Other Details

No response

TristanUnibuddy commented 1 year ago

We have hit this problem a few times recently. Update seems to be fine, but on initial creation Cloudformation doesn't wait for the services to become healthy

sepehr commented 1 year ago

We have experienced this as well, and would much appreciate support in this case. Native support for this seems to be necessary to avoid problematic stack deployments that falsely signal success.

I hope I'm not littering the issue by providing a workaround for people who maybe are looking for workarounds in the meantime. I had to spin up a monitoring lambda on the side and hook it into the stack using stack wait conditions in order to rollback in case services won't reach steady state. Something along the lines of:

const AWS = require('aws-sdk')
const ecs = new AWS.ECS()
const cfn = new AWS.CloudFormation()

exports.handler = async (event, context) => {
  const stackName = event.StackId.split('/')[1]

  const serviceName1 = 'ecs-service-1'
  const serviceName2 = 'ecs-service-2'
  const clusterName = 'ecs-cluster-name'

  try {
    const services = await ecs.describeServices({
      services: [serviceName1, serviceName2],
      cluster: clusterName
    }).promise()

    for (const service of services.services) {
      if (service.status !== 'ACTIVE' || service.desiredCount !== service.runningCount) {
        throw new Error(`ECS service "${service.serviceName}" is still not in a steady state`)
      }
    }

    // If all services are in a steady state, signal success to the WaitCondition
    await cfn.signalResource({
      StackName: stackName,
      LogicalResourceId: 'WaitHandle',
      UniqueId: context.invokedFunctionArn,
      Status: 'SUCCESS'
    }).promise()
  } catch (error) {
    // If a service is not in a steady state, signal failure to the WaitCondition
    await cfn.signalResource({
      StackName: stackName,
      LogicalResourceId: 'WaitHandle',
      UniqueId: context.invokedFunctionArn,
      Status: 'FAILURE',
      Reason: error.message
    }).promise()
  }
}

The CF stack could look like:

Resources:
  WaitHandle:
    Type: "AWS::CloudFormation::WaitConditionHandle"

  WaitCondition:
    Type: "AWS::CloudFormation::WaitCondition"
    DependsOn: LambdaInvoke
    Properties:
      Handle: !Ref WaitHandle
      Timeout: '600' # Timeout to wait for services to reach steady state

  LambdaInvoke:
    Type: "Custom::LambdaInvoke"
    Properties:
      ServiceToken: arn:aws:lambda:region:account-id:function:function-name
      StackId: !Ref "AWS::StackId"

  LambdaExecutionRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Statement:
          - Effect: "Allow"
            Principal:
              Service:
                - "lambda.amazonaws.com"
            Action:
              - "sts:AssumeRole"
      Policies:
        - PolicyName: "AllowLambdaExecution"
          PolicyDocument:
            Statement:
              - Effect: "Allow"
                Action:
                  - "logs:CreateLogGroup"
                  - "logs:CreateLogStream"
                  - "logs:PutLogEvents"
                  - "ecs:DescribeServices"
                  - "cloudformation:SignalResource"
                Resource: "*"