ECS Deployment Circuit Breaker Support

bfox1793 commented 2 years ago

Is there current support for tracking when a deployment fails and the ECS deployment breaker rolls it back? I just tried forcefully failing a deployment, and while ECS correctly triggered the circuit breaker and rolled back, ecs deploy considered it as a successful deployment.

Not sure if this is a bug or working as intended.

bfox1793 commented 2 years ago

Digging around a bit more, it seems like it considers the deployment successful once the rollback removes the attempted deployment from ecs-deploy. Not sure where that lives in the code, but a possible fix might be logic that says "if AWS returns a deployment does not exist / is deleted response, consider the deployment failed"

bfox1793 commented 2 years ago

@fabfuel any thoughts on the above? I can provide more context if it's helpful.

fabfuel commented 2 years ago

Hi Brett, sorry for the late response.

In March I added support for the ECS Circuit Breaker and added parsing of the rollout state. This means, ecs-deploy should recognize, if the deployment was rolled back.

Could you share some more details, especially about the timing. How long did you ecs-deploy instruct to wait for the deployment to be finished?

Best Fabian

bfox1793 commented 2 years ago

Hi Fabian!

I have the timeout set to 600, but it's not hitting the timeout. Here's what it looks like I'm seeing:

Deployment starts
Tasks fail to start up
ECS marks the deployment as failed, and thus triggers a rollback
ecs deploy CLI notes Deployment successful and de-registers the active revision
ECS console reflects old task is still running, but now it's deregistered

Here's a SS of my console output:

Below is a tidbit from the ECS console, one showing the failed deployment, and the other showing the active running tasks

Happy to provide any additional information if it's helpful! The built-in rollback functionality seems to work as-intended though (if the circuit breaker is disabled), so that's good news!

Brett

fabfuel commented 2 years ago

Hi Brett, thanks for all the details! I'm trying to reproduce it myself and keep you updated. Do you remember what the issue was, which stopped the container to spin up properly? Was it a configuration error (on the ECS/Docker level) or an issue with the application (inside the container)? ECS makes (or at least made, I need to check) a difference about how/why a container failed to start and how the circuit breaker kicks in.

Best Fabian

bfox1793 commented 2 years ago

Hey Fabian!

Appreciate you digging into this! The issue was that the container was missing a required env var, which caused it to fail on start-up. However, the container isn't exiting properly, so eventually it just times-out and never passes ALB healthchecks. This causes the deployment to continue recycling containers until the circuit breaker is tripped.

For added color, when I tested this with circuit breaker turned on, eventually ecs deploy would timeout (since the deploy was never successful). If I turned on the --rollback flag, it would successfully roll back. If I didn't, the deploy would continue indefinitely, spinning up new containers and then recycling them after the ALB healthcheck fails.

Let me know if any other info would be helpful!

Brett

bfox1793 commented 2 years ago

Hi @fabfuel ! I hope you had a great new years!

I was wondering if there were any updates on the above? We'd like to enable circuit breaker support, but we're hitting a big of a snag because of the issue described above.

Thanks!

Brett

fabfuel commented 2 years ago

Hi Brett,

thanks a lot, I also wish you a happy and healthy new year!

I still need to dig into this a bit more and double check, if it might be related to this ECS behavior: https://github.com/aws/containers-roadmap/issues/1206 There are cases, when the ECS Circuit Breaker does not correctly recognize a failing container.

Here is described & a screenshot, how ecs-deploy behaves in my test case in conjunction with the ECS circuit breaker. It should recognize it and through an error: https://github.com/fabfuel/ecs-deploy/issues/161#issuecomment-789676140

In your case, what happened to the container, due to the missing env var? Did the app crash or exit with exit code 1 or something similar?

Thanks for the details and sorry for moving slow, busy times 😅

Best Fabian

mohamed-haidara-cko commented 6 months ago

Hi there,

I'm facing a similar issue when ecs deploy ... marks the deployment as successful even though it wasn't the case. I also have the circuit breaker enabled with the rollback.

The container fails due to a permission issue just after entering the RUNNING state. The new deployment is rolled back after some tasks fail to start.

Is there anything I can do to move this forward? Happy to submit a PR

During what seems to be an edge case, I also got this error: same configuration as well

fabfuel commented 6 months ago

Hi @mohamed-haidara-cko,

thanks for the details. I will look into the Python error.

Could you share your task definition (JSON)?

Thanks Fabian

mohamed-haidara-cko commented 4 months ago

Hi @fabfuel

Completely missed your message. Unfortunately, this specific task definition was deleted. Is there anything else I can do to help? I try to replicate the issue on our side as well.

fabfuel / ecs-deploy

ECS Deployment Circuit Breaker Support #185