Open lamontadams opened 5 months ago
After some more testing, this absolutely has something to do with deploying when there's a new ECR image waiting to be picked up by the task. With some tweaks to health check grace period, I can deploy all day long with no issue, but as soon as a new container image is waiting everything goes bonkers and I have to trash the stack and scratch deploy to recover.
This is extremely frustrating, would love to have a workaround.
(Just saw this and maybe I'll give a helping hand - since I had a very similar issue with 10GB images)
You probably have a large container image that takes long to provision (download from ECR) and too short healthchecks. Check ECS logs and Service Events tab, that could shed some light as well.
(Just saw this and maybe I'll give a helping hand - since I had a very similar issue with 10GB images)
You probably have a large container image that takes long to provision (download from ECR) and too short healthchecks. Check ECS logs and Service Events tab, that could shed some light as well.
Thanks for this - in this case these images are relatively small, 200-300MB. I seem to recall seeing log output indicating that they start successfully but I'll pay attention the next time I try this. Like I said in the bug report, the events tab just shows an endlessly repeating cycle of start, unhealthy, stop, de-register.
I ground away on this all day yesterday and part of my problem seems to be that the defaults are a little asinine. By default, the deployment circuit breaker is disabled, and the minHealthyPercent value appears to be 100. Which seems to me like a recipe for a deadlocked deployment any time you have desiredCount > 1.
I turned on the circuit breaker, set a generous grace period, and minHealthyPercent to 50:
circuitBreaker: {
enable: true,
rollback: true,
},
desiredCount: 2,
healthCheckGracePeriod: Duration.minutes(5),
minHealthyPercent: 50,
And the situation is a little better - the circuit breaker did detect a deadlocked deployment and cancelled it... after 4 hours. At least the stack isn't stuck in an endless update, I guess?
My last gasp here is experimenting with just deploying a dummy "hello world" image to get the infrastructure set, and pushing actual image updates in response to git pushes via a CLI script. Which is, frankly, precisely the kind of situation I look to CDK to help me avoid.
If that doesn't work then I'll give up and look for some canned terraform.
Edit to add, FWIW, I have a working cluster that was hand-configured and the images I'm deploying here work fine there, so this doesn't feel like an image problem.
This just seems to be broken and unusable for me.
If I build, push and tag an image to ECR and then force a deployment via aws ecs update-service --force-new-deployment
the service updates normally and is stable. I can watch the container start and see it answering health checks in the ECS Service logs in the console.
If, however, I use ApplicationLoadBalancedFargateService
to force a deployment on the same existing service, either by supplying a different ecr tag or forcing a new task definition by modifying environment variables, the deployment reliably hangs and triggers the circuit breaker (now that I've enabled it - I still think the default-disable behavior is silly). In this case, I never see the container start in the ECS Service logs, which is really wild because it's the same image.
Hi
Let me explain a little bit about this.
CDK deploys ECS services via cloudformation(CFN in short). In CFN, ECS service deployment has to enter a stable state before CFN enters the CREATE_COMPLETE
state, which is by design from CFN. What's happening under the hood is that CFN has to make sure:
RUNNING
state.CREATE_COMPLETE
or UPDATE_COMPLETE
.With AWS CLI, when you run aws ecs update-service --force-new-deployment
. AWS CLI will immediate return the status without checking if the service has completed its rolling update nor if all health checks are passed. That being said, the operations behind the scene are totally different.
Looks like your initial deployment is good and it only fails on your update on the existing deployment?
I would like to know:
Before you cdk deploy
to update your existing successful initial deployment, can you share your cdk diff
output to see what would be changed?
After you update your deployment, are you seeing the AWS::ECS::Service stay in UPDATE_IN_PROGRESS
status? If you go to the ECS console to view the service, can you tell if they have completed the health checks? Are you seeing them being terminated and recreated due to bad health checks or any other reasons?
Are you able to see your container logs from CloudWatch Logs, was your application in your container successfully running or did it exit due to some unexpected reasons? Bad and failed command execution could result into unsuccessful health checks. Sometimes the health checks will need more graceful time before it starts the first health check because the container may need to pull large images or the application startup time might take longer before it is ready to serve traffic. You will need to observe its log and activities/events from ECS console to determine the root cause.
Try to simplify your ECS service deployment without using circus breaker or any other non-necessary features. This would help you simplify your CDK design and really focus on what really matters to ensure its core funtionalities.
Hope it helps!
Hi, and thanks for the reply.
I understand there's some very complex interaction between CDK and CFN and ECS and that both of the latter are by themselves extremely complex systems. I have kind of moved on here since I was not able to get deployments to work reliably. I'm now just using cdk to do initial environment setup, and using ecs cli commands to do all subsequent task updates. Which is far from ideal, but works.
I believe I have narrowed things down to:
cdk deploy
a new stack using ecsPatterns.ApplicationLoadBalancedFargateService
(so we're creating a new ECS cluster and all it's supporting stuff) which references that tag, the deployment succeeds, the ECS services all start successfully, and the CFN stack ends in CREATE_COMPLETE.If I then modify anything which would cause a new task definition to be created, (e.g. change one of the task definition environment values via taskImageOptions.environment
) EDIT TO ADD (crucially, I think): using the same image and tag , then a subsequent deployment will trigger the circuit breaker (if it's been explicitly enabled, see below) and the update will fail. I have not done a diff
here but I'm confident from comparing synth output that this is all that's changed.
In both situations, I have been able to see output from running task images indicating to me they have started in both ECS and Cloudwatch logs, and they seem to be running before what I understand through experimentation to be the controlling metric (the health check grace period) has elapsed.
circuitBreaker.enable
value should be "true" because in the above situation, with the current default value of "false", it's in my experience very easy (indeed almost guaranteed) to wind up with an ECS update that never finishes (remaining locked in a cycle where it is restarting new tasks) and a CFN stack which therefore remains long-term (5+ hours) stuck in "UPDATE_IN_PROGRESS" status. The only way I found to resolve this situation is to manually intervene in the console by deleting the ECS services and cluster, cancel the CFN stack update, and then destroy the stack. That's a terrible user experience, and frankly if this were my first attempt at provisioning infrastructure via CDK (I am, in fact, very successfully using it to manage a large cloud-native platform), I would have put it down, walked away and never looked back.
Describe the bug
Initial deployments using ApplicationLoadBalancedFargateService from ecs-patterns complete successfully and generate working, healthy, reachable services. All subsequent deployments fail with a particular series of events:
The situation will not resolve itself over a duration of 6 hours.
If a user cancels the cdk deployment script, then:
However, of course the changes in the stack update haven't been applied.
Have reproduced in the following conditions:
This is pretty severe and it's preventing us from using CDK to manage any ECS infrastructure at all.
Expected Behavior
The CF stack to update successfully on subsequent deployments - and for ECS service updates to successfully happen only when they are necessary. Based on my testing and experimentation, I'm seeing ECS updates being made when nothing about the service has been changed in my code, which is confusing at best.
Current Behavior
As above. Deployments subsequent to the first fail with a hung "UPDATE_IN_PROGRESS" stack, apparently because ECS health checks are failing. Interesting that this occurs even if the changes do not impact any ecs services or tasks - just unrelated changes in the same stack - like an SSM parameter rename or value change.
Reproduction Steps
I'm using CDK through a wrapper package that supplies a bunch of boilerplate for consistent naming and whatnot. Happy to provide more info.
Sample reproduction code (typescript):
Sample CF template:
Possible Solution
No response
Additional Information/Context
Open to alternative suggestions or workarounds. Landed on ecs-patterns because it was the quickest way to get a service up and running from scratch, not married to it.
CDK CLI Version
2.139.1 (and also 2.147.2)
Framework Version
No response
Node.js Version
18 and 21
OS
Linux Ubuntu (real and github workflow runner image)
Language
TypeScript
Language Version
5.0.4 and 5.5.3
Other information
No response