Open lamontadams opened 4 days ago
After some more testing, this absolutely has something to do with deploying when there's a new ECR image waiting to be picked up by the task. With some tweaks to health check grace period, I can deploy all day long with no issue, but as soon as a new container image is waiting everything goes bonkers and I have to trash the stack and scratch deploy to recover.
This is extremely frustrating, would love to have a workaround.
(Just saw this and maybe I'll give a helping hand - since I had a very similar issue with 10GB images)
You probably have a large container image that takes long to provision (download from ECR) and too short healthchecks. Check ECS logs and Service Events tab, that could shed some light as well.
(Just saw this and maybe I'll give a helping hand - since I had a very similar issue with 10GB images)
You probably have a large container image that takes long to provision (download from ECR) and too short healthchecks. Check ECS logs and Service Events tab, that could shed some light as well.
Thanks for this - in this case these images are relatively small, 200-300MB. I seem to recall seeing log output indicating that they start successfully but I'll pay attention the next time I try this. Like I said in the bug report, the events tab just shows an endlessly repeating cycle of start, unhealthy, stop, de-register.
I ground away on this all day yesterday and part of my problem seems to be that the defaults are a little asinine. By default, the deployment circuit breaker is disabled, and the minHealthyPercent value appears to be 100. Which seems to me like a recipe for a deadlocked deployment any time you have desiredCount > 1.
I turned on the circuit breaker, set a generous grace period, and minHealthyPercent to 50:
circuitBreaker: {
enable: true,
rollback: true,
},
desiredCount: 2,
healthCheckGracePeriod: Duration.minutes(5),
minHealthyPercent: 50,
And the situation is a little better - the circuit breaker did detect a deadlocked deployment and cancelled it... after 4 hours. At least the stack isn't stuck in an endless update, I guess?
My last gasp here is experimenting with just deploying a dummy "hello world" image to get the infrastructure set, and pushing actual image updates in response to git pushes via a CLI script. Which is, frankly, precisely the kind of situation I look to CDK to help me avoid.
If that doesn't work then I'll give up and look for some canned terraform.
Edit to add, FWIW, I have a working cluster that was hand-configured and the images I'm deploying here work fine there, so this doesn't feel like an image problem.
This just seems to be broken and unusable for me.
If I build, push and tag an image to ECR and then force a deployment via aws ecs update-service --force-new-deployment
the service updates normally and is stable. I can watch the container start and see it answering health checks in the ECS Service logs in the console.
If, however, I use ApplicationLoadBalancedFargateService
to force a deployment on the same existing service, either by supplying a different ecr tag or forcing a new task definition by modifying environment variables, the deployment reliably hangs and triggers the circuit breaker (now that I've enabled it - I still think the default-disable behavior is silly). In this case, I never see the container start in the ECS Service logs, which is really wild because it's the same image.
Describe the bug
Initial deployments using ApplicationLoadBalancedFargateService from ecs-patterns complete successfully and generate working, healthy, reachable services. All subsequent deployments fail with a particular series of events:
The situation will not resolve itself over a duration of 6 hours.
If a user cancels the cdk deployment script, then:
However, of course the changes in the stack update haven't been applied.
Have reproduced in the following conditions:
This is pretty severe and it's preventing us from using CDK to manage any ECS infrastructure at all.
Expected Behavior
The CF stack to update successfully on subsequent deployments - and for ECS service updates to successfully happen only when they are necessary. Based on my testing and experimentation, I'm seeing ECS updates being made when nothing about the service has been changed in my code, which is confusing at best.
Current Behavior
As above. Deployments subsequent to the first fail with a hung "UPDATE_IN_PROGRESS" stack, apparently because ECS health checks are failing. Interesting that this occurs even if the changes do not impact any ecs services or tasks - just unrelated changes in the same stack - like an SSM parameter rename or value change.
Reproduction Steps
I'm using CDK through a wrapper package that supplies a bunch of boilerplate for consistent naming and whatnot. Happy to provide more info.
Sample reproduction code (typescript):
Sample CF template:
Possible Solution
No response
Additional Information/Context
Open to alternative suggestions or workarounds. Landed on ecs-patterns because it was the quickest way to get a service up and running from scratch, not married to it.
CDK CLI Version
2.139.1 (and also 2.147.2)
Framework Version
No response
Node.js Version
18 and 21
OS
Linux Ubuntu (real and github workflow runner image)
Language
TypeScript
Language Version
5.0.4 and 5.5.3
Other information
No response