Open emspoars opened 2 years ago
We have hit this problem a few times recently. Update seems to be fine, but on initial creation Cloudformation doesn't wait for the services to become healthy
We have experienced this as well, and would much appreciate support in this case. Native support for this seems to be necessary to avoid problematic stack deployments that falsely signal success.
I hope I'm not littering the issue by providing a workaround for people who maybe are looking for workarounds in the meantime. I had to spin up a monitoring lambda on the side and hook it into the stack using stack wait conditions in order to rollback in case services won't reach steady state. Something along the lines of:
const AWS = require('aws-sdk')
const ecs = new AWS.ECS()
const cfn = new AWS.CloudFormation()
exports.handler = async (event, context) => {
const stackName = event.StackId.split('/')[1]
const serviceName1 = 'ecs-service-1'
const serviceName2 = 'ecs-service-2'
const clusterName = 'ecs-cluster-name'
try {
const services = await ecs.describeServices({
services: [serviceName1, serviceName2],
cluster: clusterName
}).promise()
for (const service of services.services) {
if (service.status !== 'ACTIVE' || service.desiredCount !== service.runningCount) {
throw new Error(`ECS service "${service.serviceName}" is still not in a steady state`)
}
}
// If all services are in a steady state, signal success to the WaitCondition
await cfn.signalResource({
StackName: stackName,
LogicalResourceId: 'WaitHandle',
UniqueId: context.invokedFunctionArn,
Status: 'SUCCESS'
}).promise()
} catch (error) {
// If a service is not in a steady state, signal failure to the WaitCondition
await cfn.signalResource({
StackName: stackName,
LogicalResourceId: 'WaitHandle',
UniqueId: context.invokedFunctionArn,
Status: 'FAILURE',
Reason: error.message
}).promise()
}
}
The CF stack could look like:
Resources:
WaitHandle:
Type: "AWS::CloudFormation::WaitConditionHandle"
WaitCondition:
Type: "AWS::CloudFormation::WaitCondition"
DependsOn: LambdaInvoke
Properties:
Handle: !Ref WaitHandle
Timeout: '600' # Timeout to wait for services to reach steady state
LambdaInvoke:
Type: "Custom::LambdaInvoke"
Properties:
ServiceToken: arn:aws:lambda:region:account-id:function:function-name
StackId: !Ref "AWS::StackId"
LambdaExecutionRole:
Type: "AWS::IAM::Role"
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: "Allow"
Principal:
Service:
- "lambda.amazonaws.com"
Action:
- "sts:AssumeRole"
Policies:
- PolicyName: "AllowLambdaExecution"
PolicyDocument:
Statement:
- Effect: "Allow"
Action:
- "logs:CreateLogGroup"
- "logs:CreateLogStream"
- "logs:PutLogEvents"
- "ecs:DescribeServices"
- "cloudformation:SignalResource"
Resource: "*"
Name of the resource
AWS::ECS::Service
Resource Name
No response
Issue Description
Cloudformation, when used to deploy an ECS service, does not always respect the fact that the service should reach a steady state before receiving the "CREATE COMPLETE/UPDATE_COMPLETE" status.
If an invalid image if specified in the corresponding task, the result is as expected - the service does not reach a steady state and the resource creation/update fails.
If, however, the image exists but the service does not reach a steady state because of an ECS health check, Cloudformation simply considers the resource as OK and allows it to reach "CREATE COMPLETE/UPDATE_COMPLETE" status, even before ECS gives up starting the service.
Here are other relevant issues about this:
https://github.com/aws/containers-roadmap/issues/897 This was about the fact that ECS services reached steady state before the container healthcheck passed. This was fixed a few years ago (I've confirmed by testing today).
https://github.com/aws-cloudformation/cloudformation-coverage-roadmap/issues/150 This is about the fact that Cloudformation does not recognize a failed ECS deployment, instead timing out after 3 hours. This issue is still open (and I've confirmed by testing that the problem still exists).
Expected Behavior
Cloudformation, when used to deploy an ECS service, should wait for the service to reach a steady state before giving the resource a CREATE_COMPLETE or UPDATE_COMPLETE status.
Observed Behavior
Cloudformation, when used to deploy an ECS service, gives the resource a CREATE_COMPLETE/UPDATE_COMPLETE status before the service reaches a steady state.
Test Cases
The resource in Cloudformation will reach the CREATE_COMPLETE/UPDATE_COMPLETE status but the ECS service will not stabilize.
Other Details
No response