fabfuel / ecs-deploy

Powerful CLI tool to simplify Amazon ECS deployments, rollbacks & scaling
Other
843 stars 145 forks source link

ECS circuit breaker (question) #161

Open stadskle opened 3 years ago

stadskle commented 3 years ago

Has anyone tested this tool with the new circuit breaker?

https://aws.amazon.com/blogs/containers/announcing-amazon-ecs-deployment-circuit-breaker/

It is on our plan, but have not manged to do it yet, so just wanted to hear if anyone have tried. I suspect it might require some changes in the deploy script, to ensure that ecs deploy reports failed deploy correctly if ECS aborts it?

fabfuel commented 3 years ago

Hi @stadskle,

I have not tested a deployment with ecs-deploy and an active circuit breaker configuration yet.

My gut feeling is, that it should not cause problems, as ecs-deploy only fetches the deployment state from ECS and waits until it changed to Completed, if this never happens, ecs-deploy will report the deployment as failed. But I will follow up and share my findings here.

Best Fabian

stadskle commented 3 years ago

Yes, I guess the ecs deploy will time out, and report as failed as it does today. But it will not report Failed the moment the breaker kicks in I guess? Example here from the blog post:

image

fabfuel commented 3 years ago

Yes, for now only a timeout would be reported if the whole process takes too long. Currently the check of the Deployment entity is not as explicit as it could be. I'll look into this a bit deeper with the goal to report explicitly when the deployment failed, independent if the circuit breaker is activated or not - at the time, when the original deployment failed (before an optional rollback).

During my tests I discovered, that this new feature does not cover all cases of failing containers. For example, if you specify an invalid Docker CMD (what I did for a quick fix), this is not covered by the circuit breaker, the deployment will still retry forever in this case.

This is a known limitation, it's covered in this issue: https://github.com/aws/containers-roadmap/issues/1206

Best Fabian

fabfuel commented 3 years ago

The deployment check now utilizes the new rolloutState property of the ECS deployment entity. So far the identification, if a deployment finished needed to be done based on the number of stably running tasks of the expected task definition.

With this change, we can utilize the new circuit breaker feature and:

  1. monitor the number of failed tasks during a deployment and
  2. alert if the deployment has failed and the circuit breaker kicked it
Screenshot 2021-03-03 at 13 15 37

The feature is not released yet, but available in a feature branch for now, if anybody wants to chime it and test this new behavior.

Best Fabian