Open bfox1793 opened 2 years ago
Digging around a bit more, it seems like it considers the deployment successful once the rollback removes the attempted deployment from ecs-deploy. Not sure where that lives in the code, but a possible fix might be logic that says "if AWS returns a deployment does not exist / is deleted response, consider the deployment failed"
@fabfuel any thoughts on the above? I can provide more context if it's helpful.
Hi Brett, sorry for the late response.
In March I added support for the ECS Circuit Breaker and added parsing of the rollout state. This means, ecs-deploy should recognize, if the deployment was rolled back.
Could you share some more details, especially about the timing. How long did you ecs-deploy instruct to wait for the deployment to be finished?
Best Fabian
Hi Fabian!
I have the timeout set to 600, but it's not hitting the timeout. Here's what it looks like I'm seeing:
Deployment successful
and de-registers the active revisionHere's a SS of my console output:
Below is a tidbit from the ECS console, one showing the failed deployment, and the other showing the active running tasks
Happy to provide any additional information if it's helpful! The built-in rollback
functionality seems to work as-intended though (if the circuit breaker is disabled), so that's good news!
Brett
Hi Brett, thanks for all the details! I'm trying to reproduce it myself and keep you updated. Do you remember what the issue was, which stopped the container to spin up properly? Was it a configuration error (on the ECS/Docker level) or an issue with the application (inside the container)? ECS makes (or at least made, I need to check) a difference about how/why a container failed to start and how the circuit breaker kicks in.
Best Fabian
Hey Fabian!
Appreciate you digging into this! The issue was that the container was missing a required env var, which caused it to fail on start-up. However, the container isn't exiting properly, so eventually it just times-out and never passes ALB healthchecks. This causes the deployment to continue recycling containers until the circuit breaker is tripped.
For added color, when I tested this with circuit breaker turned on, eventually ecs deploy would timeout (since the deploy was never successful). If I turned on the --rollback flag, it would successfully roll back. If I didn't, the deploy would continue indefinitely, spinning up new containers and then recycling them after the ALB healthcheck fails.
Let me know if any other info would be helpful!
Brett
Hi @fabfuel ! I hope you had a great new years!
I was wondering if there were any updates on the above? We'd like to enable circuit breaker support, but we're hitting a big of a snag because of the issue described above.
Thanks!
Brett
Hi Brett,
thanks a lot, I also wish you a happy and healthy new year!
I still need to dig into this a bit more and double check, if it might be related to this ECS behavior: https://github.com/aws/containers-roadmap/issues/1206 There are cases, when the ECS Circuit Breaker does not correctly recognize a failing container.
Here is described & a screenshot, how ecs-deploy
behaves in my test case in conjunction with the ECS circuit breaker. It should recognize it and through an error: https://github.com/fabfuel/ecs-deploy/issues/161#issuecomment-789676140
In your case, what happened to the container, due to the missing env var? Did the app crash or exit with exit code 1 or something similar?
Thanks for the details and sorry for moving slow, busy times 😅
Best Fabian
Hi there,
I'm facing a similar issue when ecs deploy ...
marks the deployment as successful even though it wasn't the case. I also have the circuit breaker enabled with the rollback.
The container fails due to a permission issue just after entering the RUNNING
state. The new deployment is rolled back after some tasks fail to start.
Is there anything I can do to move this forward? Happy to submit a PR
During what seems to be an edge case, I also got this error: same configuration as well
Hi @mohamed-haidara-cko,
thanks for the details. I will look into the Python error.
Could you share your task definition (JSON)?
Thanks Fabian
Hi @fabfuel
Completely missed your message. Unfortunately, this specific task definition was deleted. Is there anything else I can do to help? I try to replicate the issue on our side as well.
Is there current support for tracking when a deployment fails and the ECS deployment breaker rolls it back? I just tried forcefully failing a deployment, and while ECS correctly triggered the circuit breaker and rolled back, ecs deploy considered it as a successful deployment.
Not sure if this is a bug or working as intended.