[ECS] [Deployment]: ECS deployment circuit breaker should deal with exceptional exit container

forward2you commented 3 years ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request What do you want us to build?

Use Cloudformation to update ECS background task

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

As the demo of Deployment circuit breaker, the container start failed with error: docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "exec: \"exit\": executable file not found in $PATH": unknown. Which would be handled by deployment circuit breaker.

However more common case is container start succeed but exceptional exit.

For example, the Dockerfile:

FROM alpine:latest
exit 1

The current situation is the container stoped with Essential container in task exit and marked as failed, but when the second task start, the failedTasks count will be reset to 1, which means the circuit breaker threshold will never triggered.

What we expected is, regard running but abnormal exit as failed, and not reset the failedTasks count, then meet the breaker threshold and roll back.

LeMikaelF commented 3 years ago

This issue could probably use a higher priority, since even the Circuit Breaker official demo (https://www.youtube.com/watch?v=Y2Ez9M7A95Y) doesn't work because of this behaviour.

It also makes the CB unreliable, as it doesn't catch all types of deployment failures. As an example, I have microservices that will throw an exception and exit early if there is an error in their DB connection string. With the current behaviour of CB, deploying one of those with a wrong DB string will churn on forever.

nickfaughey commented 3 years ago

Any more visibility on this issue? This behavior kind of defeats the purpose of the circuit breaker feature - if we can't trust it to catch all types of ECS task failures, we'll need to implement our own fail-safes, alerts, and rollback functionality anyway.

mimozell commented 3 years ago

Having a similar issue to @LeMikaelF where the app shuts down when a required environment variable is missing. The app just shuts down and the failedTasks is incremented to 1, but as soon as that happens, it is decremented back to 0 and the threshold is never crossed. If CBs don't work for this kind of situation then they are pretty useless to our team :(

dezren39 commented 3 years ago

Can someone explain why it decrements to begin with? I'm not sure I understand.

jtyoung commented 3 years ago

I just spent an entire day trying to figure out why my circuit breaker was never triggering before coming across this issue. I dutifully followed the documentation and built out the SNS topic and EventBridge rules and subscribed them to Datadog to send me notifications about when my deploys fail, only to discover that was all wasted effort because the circuit breaker is functionally useless.

I just need to know when my containers are spinning up, "running" for a few seconds, and dying before ever being marked healthy by the ALB they sit behind. This certainly seems like core functionality of a deployment circuit breaker, and the documentation absolutely misleads you into thinking that this is how the circuit breaker will behave. This paragraph says that the circuit breaker will trip if the ALB healthchecks mark the container as unhealthy, but if the container exits before the ALB healthchecks run enough times to mark it as unhealthy, then that container is considered deployed successfully and it just retries forever.

Even a circuit breaker case as naive as saying "if this deploy hasn't been marked as completed in X minutes, mark it as failed" would be beneficial for this specific case.

robert-put commented 3 years ago

Ran into this issue the other day. Caused a failing deploy to keep failing vs rolling back. This was then followed by autoscaling trying to scale up the failing deployment (because it was more recent?). Then issues with scaling being too low on the old deploy that was still live until the failing deployment was taken care of with manual intervention.

Does anyone have a good workaround for this until its resolved?

jenshoffmann1331 commented 3 years ago

@robert-put The workaround is to not use circuit breaker at all. Do something on your own. Projects like ecs-deploy may help with that.

hampsterx commented 3 years ago

Just ran into similar issue. Deployment circuit breaker: "enabled with rollback"

Updated task definition and the deployment was stuck in "In progress" for at least 30m, no events at all. Tried twice more, no further deployments or events, basically had to delete the service and create it again, not ideal to say the least..

vibhav-ag commented 2 years ago

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.

nahum-litvin-hs commented 2 years ago

this issue was the straw that broke the camel's back. we moved to kubernetes.

On Fri, Dec 17, 2021, 00:45 vibhav-ag @.***> wrote:

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung https://avanan.url-protection.com/v1/url?o=https%3A//github.com/jtyoung&g=YjE3OTI1NTU3OGRlMTdjNw==&h=OWMzNTRjNjJmYzAwZjRjMWIxNzdkZGM0YzdlZmQ3YzllZmUwMmYwMjAxOTNlMmViZjBmMDE3YWFmMmJlOGM3Mw==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA== I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.

— Reply to this email directly, view it on GitHub https://avanan.url-protection.com/v1/url?o=https%3A//github.com/aws/containers-roadmap/issues/1206%23issuecomment-996251344&g=ZWVjM2QwYTIyZTFkZDA0MA==&h=M2NlZDNmMjgxNGQ5YTk0MjJlNzY2MGE4MGIxNWI4MjMxMWE5MTE4MTBkOTc5MGJmN2VmNmNmZGQ0ZmY0YWNkZA==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA==, or unsubscribe https://avanan.url-protection.com/v1/url?o=https%3A//github.com/notifications/unsubscribe-auth/AS2AFAQBEDGGF2MF4IUI4U3URJTYJANCNFSM4VIDDEMA&g=YzIzZDBlNmI2NjhiYzc2NQ==&h=OWVmMTA3ZjFjNWM4YTRhYjAyMzc3YmFjMmI4NTFiMjYyODFlZDE4M2QzNzM1NjBiNzc5ODRiY2YzZmIxOWIwYQ==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA== . Triage notifications on the go with GitHub Mobile for iOS https://avanan.url-protection.com/v1/url?o=https%3A//apps.apple.com/app/apple-store/id1477376905%3Fct%3Dnotification-email%26amp%3Bmt%3D8%26amp%3Bpt%3D524675&g=M2IzZmU0ZjJiY2RkNDM1Mw==&h=YWNhNGE3YjRlY2MyZjljNGFiOGMyMmY3Yzg0ZjE3ZDZiMjRkOTJlNmEyNGE3MTBiMzIwOTdiMDQ2MGNkYzJiMg==&p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA== or Android https://avanan.url-protection.com/v1/url?o=https%3A//play.google.com/store/apps/details%3Fid%3Dcom.github.android%26amp%3Breferrer%3Dutm_campaign%253Dnotification-email%2526utm_medium%253Demail%2526utm_source%253Dgithub&g=ZDQ0M2Q2OTY3NmEwZDVlMQ==&h=YWY4ZjkxMGU4MGM4ZmNkZjFmNzgxZGZhNzQ4ZjdjYTQyMGNhMmMwNzllNTQ1YWFjOGY4ZTllNjVlZWQ4MGViYQ==&%20p=YXAzOmhpcmVkc2NvcmU6YXZhbmFuOmc6ZGY1ODdiYTRiODE4YzE3ODgzOGMyNzA5ZWVkYWYxMWY6djE6aA==.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

jgrumboe commented 2 years ago

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.

I also think that CB only works well in combination with defined container health checks.

thule0 commented 2 years ago

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset.

I am having this issue right now, and the failedTask count is being reset every time. My container fails to start way before it could respond to the healthcheck.

leoddias commented 2 years ago

Im having this issue right now... Pipeline timing out, deploy rollout infinitely, I`ve to force a rollback (updating service task version) manually

vibhav-ag commented 2 years ago

@leoddias @thule0 thank you for flagging this. Could you please reach out to me at agvibhav@amazon.com with more details so we can triage this.

thule0 commented 2 years ago

@vibhav-ag how can I help? Do you have trouble reproducing this?

It has always been like that for me: deploy a working container, then try to deploy a container that is fundamentally broken and cannot start, it keeps retrying, circuit breaker does not stop this process.

jgrumboe commented 2 years ago

@thule0 Do you have container healthchecks configured in your taskdefinition?

thule0 commented 2 years ago

@jgrumboe I tried both with and without a defined healthcheck, same result.

leoddias commented 2 years ago

@vibhav-ag @jgrumboe I`ve everything implemented trough IaC, and I have the following template for taskdefinition:

  TaskDefinition:
      Type: AWS::ECS::TaskDefinition
      Properties:
        Family: !Sub "${StackNamePrefix}-${ServiceName}"
        NetworkMode: awsvpc
        RequiresCompatibilities:
          - FARGATE
        Cpu: !Ref ContainerCpu
        Memory: !Ref ContainerMemory
        ExecutionRoleArn: !Ref ExecutionRoleArn
        TaskRoleArn: !Ref TaskRoleArn
        ContainerDefinitions:
          - Name: !Ref ServiceName
            Image: !Sub "{{resolve:ssm:${ApplicationImageParameter}}}"
            PortMappings:
              - ContainerPort: !Ref ContainerPort
            HealthCheck:
              Interval: !Ref HealthCheckInterval
              Retries: !Ref HealthCheckRetries
              StartPeriod: !Ref StartPeriod
              Timeout: !Ref HealthCheckTimeout
              Command:
                - CMD-SHELL
                - !Sub 'curl -f http://127.0.0.1:${ContainerPort}${HealthCheckPath} || exit 1'
            LogConfiguration:
              LogDriver: awslogs
              Options:
                awslogs-region: !Ref AWS::Region
                awslogs-group: !Ref LogGroup
                awslogs-stream-prefix: ecs
            Environment:
              - Name: ENV
                Value: !Ref Environment
              - Name: PORT 
                Value: !Ref ContainerPort 
              - Name: NODE_ENV
                Value: !Ref Environment
              - Name: SPRING_PROFILES_ACTIVE
                Value: !Ref Environment
              - Name: APP_TYPE
                Value: !Ref AppType
            DockerLabels:
              traefik.enable: true

As you can see I have container health checks and as you probaly know I dont have target group since I use traefik as router. Things that we use in this workload: Codepipeline with ecs deploy stage ECS Fargate with rollback enabled The issue happens in every deployment that fails on boot, and gave us the following message: "Stopped reason Essential container in task exited". Looping the new tasks infinitely (is necessary a new manual deployment at service, informing the previouse task def)

Let me know if you guys need more details

vibhav-ag commented 2 years ago

Thanks @leoddias this is helpful- will look into this and circle back.

ayozemr commented 2 years ago

I am facing another case that I think fits this topic. I have an app that applies db migrations on startup. There is now a deployment failing for 15mins because a bad migration, but failedTasks is always 1 despite it has retried many times now.

I have set ALB checks bc they are mandatory and also ECS health checks, all via CDK:

App error logs:

| 2022-03-08T10:40:50.082+00:00 | npm ERR! code ELIFECYCLE
| 2022-03-08T10:40:50.082+00:00 | npm ERR! errno 1
| 2022-03-08T10:40:50.085+00:00 | npm ERR! api@0.1.0 start: `strapi start`
| 2022-03-08T10:40:50.086+00:00 | npm ERR! Exit status 1

ECS task exit error:

Stopped reason Essential container in task exited

Service deployment:

  "desiredCount": 1,
  "pendingCount": 1,
  "runningCount": 0,
  "failedTasks": 1,
  "createdAt": "2022-03-08T10:25:18.050000+00:00",
  "updatedAt": "2022-03-08T10:25:18.050000+00:00",
  "launchType": "FARGATE",
  "platformVersion": "1.4.0",
  "platformFamily": "Linux",

SkySails commented 2 years ago

We're facing this issue as well, so I'd like to add a concrete example to the reports here.

Note: Sorry for the WOT, I just want to make sure that I cover as much of what I have gathered as possible in the hopes that it will either be falsified by someone who has had even better insights or helps someone that is stuck with this issue.

The scenario

Let's say that the infrastructure looks like this:

Infrastructure

The tasks/containers consist of APIs that require details from Secrets Manager in order to connect to a database. The secrets are fetched from within the task itself as part of the initialization of the API. If it is unable to reach Secrets Manager for some reason, the API exits with a non-zero exit code.

Let's say that there is a new deployment to the service that adds a new secret without updating the permissions for the task appropriately, resulting in a permission issue. The API starts normally (the container enters RUNNING state) and within a few seconds it reaches the point where it is supposed to fetch the secrets, but it fails and exits. When this happens, the task status transitions into STOPPED with an Essential container in task exited error message, as expected.

The issue

If you were running the container locally, depending on your configuration you might expect the container to just perform a restart (restart-always). This is often fine if the error is an exception caused by something temporary, but in this case, the task will just exit again immediately.

According to the documentation for ECS Services, at least AFAIU, restart attempts for containers that repeatedly fail to enter a RUNNING state will be throttled. I can't find any similar information about containers that do enter a RUNNING state before failing, so I am assuming that it will just restart automatically in a loop similar to a restart-always policy. This is also what I have been observing in real situations so far.

In the described scenario, that means that the failing container will be infinitely restarted unless someone manually intervenes and updates the service to use a working (previous) task definition. The only way a developer would find out that this has happened is when the CI times out when checking for successful deployment, which can take up to 10 minutes for the AWS CLI by default in my experience.

The "solution"

In order to mitigate the above scenario, one could implement rollbacks. In a perfect world, a rollback feature would at least recognize that:

A new deployment is not able to reach a RUNNING state for more than n seconds
A successfully deployed & running task is not responding to health checks

The ECS Deployment Circuit Breaker and its rollback option seems to cover the above when reading the documentation:

rollback

Determines whether to configure Amazon ECS to roll back the service if a service deployment fails. If rollback is enabled, when a service deployment fails, the service is rolled back to the last deployment that completed successfully.

I think it's safe to say that the descriptions seems to imply that failing tasks will result in a failed deployment an subsequently a rollback. But this is just half the truth.

Clarification

After a chat with AWS support, this is what I have been able to establish about the situation:

What the documentation really wants to say is that tasks that fail immediately without ever reaching a RUNNING state will cause a service deployment to fail. That's the catch.

Tasks that successfully reaches a RUNNING state, be it for a few seconds, has to fail the associated health checks. If the task exits before it has any chance to answer to any of the health checks, an infinite loop begins.

rocco-alchemy commented 2 years ago

Wanted to share an update here: we have made a change to circuit breaker to ensure that if a Task fails before reporting a health check, the failedTask count is not reset. @jtyoung I believe this should resolve the issue you faced. For others, do you also have health checks configured for your services? If so, circuit breaker should now work for your use cases.

@vibhav-ag , I can confirm this is not working as intended. The failedTasks count goes up, but then two seconds later is discarded or decremented.

  {
    "id": "ecs-svc/5835613358439999935",
    "status": "PRIMARY",
    "taskDefinition": "arn:aws:ecs:us-east-1:[redacted]:task-definition/fail_fast_testing:2",
    "desiredCount": 2,
    "pendingCount": 2,
    "runningCount": 0,
    "failedTasks": 2,
    "createdAt": "2022-04-13T17:10:06.266000-04:00",
    "updatedAt": "2022-04-14T11:55:11.607000-04:00",

{
    "id": "ecs-svc/5835613358439999935",
    "status": "PRIMARY",
    "taskDefinition": "arn:aws:ecs:us-east-1:[redacted]:task-definition/fail_fast_testing:2",
    "desiredCount": 2,
    "pendingCount": 1,
    "runningCount": 1,
    "failedTasks": 0,
    "createdAt": "2022-04-14T11:54:54.873000-04:00",
    "updatedAt": "2022-04-14T11:56:01.402000-04:00",

Can we get a status on this? Catching tasks that fail to run is a core use-case of this feature.

kszarlej commented 2 years ago

I'd like to give +1 to this topic. As a long-term user of ECS, I was actually very confused that the current Deployment Circuit Breaker doesn't count tasks that fail with exit code 1 during deployment but only works for situations when the task is unable to be placed on the cluster entirely. That happens quite infrequently because once the execution policy is properly crafted for a service then the MAJORITY of the cases when a rollback is needed is when the new version of the application quickly fails e.g. due to a missing ENV variable.

Ideally, the tasks that were scheduled on the cluster, started and failed while the deployment was still running should be counted as failedTasks as well.

The deployment structure returned by the describe-services contains a rolloutState field that is set to IN_PROGRESS while the deployment is still running. A failedTasks value should never be reset to 0 if the rolloutState of a deployment is IN_PROGRESS. When the deployment is finished rolloutState is set to COMPLETED and from that moment a circuit breaker should be deactivated.

I think that everybody expects the Deployment Circuit Breaker to work pretty much as I described above.

Best Regards, Krzysztof

jishi commented 2 years ago

I just stumbled upon this problem as well. Given a container image with basically CMD exit 2 does eventually fail, but failedTasks does not count every failed task and it takes about 30 minutes for it to realize that it is indeed broken. So it "semi-works".

Now, adjusting this to a more real-life scenario, doing CMD sleep 10 && exit 2 on the container (meaning, it starts to wire up whatever implementation it has, then eventually fail), the failedTasks count stays at 0.

Makes the whole feature only useful for obvious errors and missing dependencies or similar.

vibhav-ag commented 2 years ago

Hi All, thanks for flagging the issue- we are triaging this and will circle back with an update here.

kszarlej commented 2 years ago

Hello @vibhav-ag any update on that?

vibhav-ag commented 2 years ago

@kszarlej thanks for following-up. We did identify some issues here and are working on making some changes- I will share an update on the thread once changes are rolled out.

ninerealms commented 2 years ago

Hi, do we know when this will be resolved, this is costing us thousands of dollars per month because we have hundreds of ECS services and AWS Config recording configuration changes.

kszarlej commented 2 years ago

@vibhav-ag Could you shed some more light on what did you find out and when we can expect some fixes to be live? If we won't have reliable automated rollbacks I might be forced to migrate maaaany ECS services to K8S unfortunately :(

vibhav-ag commented 2 years ago

@kszarlej thanks for following up.

Sharing some additional context here: ECS Circuit Breaker distinguishes between Tasks that fail to launch and Tasks that fail shortly after starting (i.e fast-failing containers). Because of this, for scenarios where some Tasks in a service fail to launch while others fast-fail, the failedTasks count (the max of these 2 scenarios) can keep getting reset. To get to a more consistent experience, we are fixing this issue so that the failedTasks count (across both of these 2 scenarios) is only reset when a Task passes health checks (or runs for a minimum period of time in case no health checks are configured). I can't share a concrete timeline, but I can say that we are actively working on rolling out this change and I will share an update once it is available.

Separately, for automated rollbacks, another capability we are looking to add is integration with CloudWatch alarms so that if an alarm is triggered, ECS would rollback the deployment. This is further out, but we would love to hear if this would be valuable for your use case as well.

vibhav-ag commented 2 years ago

@ninerealms Could you please share more context about the additional charges incurred for AWS Config?

ninerealms commented 2 years ago

@ninerealms Could you please share more context about the additional charges incurred for AWS Config?

Of course.

So we have an AWS Config recorder detecting changes to all services. When an ECS service fails to start successfully, either due to a failing health check or a fast exit, we encounter a DeleteNetworkInterface API call. When ECS attempts to restart the service, we encounter a CreateNetworkInterface API call. AWS config costs $0.003 per configuration item recorded. If ECS attempts to restart a task more than 2000 times per day due to Circuit Breaker not working effectively for a fast exit or health check timeout then that's 4000 additional configuration items recorded or $12 per task. Multiply that by 70 ECS services and thats $840 per day. Having gone back and performed some analysis, Circuit Breaker failing to work has cost us somewhere in the region of $12000 in the last two weeks. My organisation would greatly appreciate two free tickets to re:Invent this year by way of compensation. ;-)

kwn commented 2 years ago

We also experienced extra costs due to the broken circuit breaker feature.

Our ECR repositories and ECS clusters are located in separate AWS accounts. We provisioned bunch of services that stuck in an infinite deployment loop, and left them unmanaged for few days (circuit breaker was enabled). ECS was constantly pulling container images from ECR located in a separate AWS account, making us charged for terabytes of data that go through VPC/NAT. Luckily we realised that something is screwed up quickly enough.

I'm massively disappointed that AWS releases such an unreliable feature. It would be nice to see different strategies for circuit breaker (e.g. deployment is failing for X minutes).

thule0 commented 2 years ago

I just had circuit breaker abort a rollout because containers exited. Thank you, seems like it works.

mrcrowl commented 2 years ago

@Roman…. Did it roll back to the previous version with no downtime?

On Wed, 6 Jul 2022 at 2:21 AM, Roman Alekseev @.***> wrote:

I just had circuit breaker abort a rollout because containers exited. Thank you, seems like it works.

— Reply to this email directly, view it on GitHub https://github.com/aws/containers-roadmap/issues/1206#issuecomment-1175119349, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANWRIG4OEZIZCIKLNKQ6RDVSRAE7ANCNFSM4VIDDEMA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

RichiCoder1 commented 2 years ago

Separately, for automated rollbacks, another capability we are looking to add is integration with CloudWatch alarms so that if an alarm is triggered, ECS would rollback the deployment. This is further out, but we would love to hear if this would be valuable for your use case as well.

Wanted to respond to this, I'd personally love this. Would take some inspiration from how it's configured in CodeDeploy https://docs.aws.amazon.com/codedeploy/latest/userguide/deployment-groups-configure-advanced-options.html. (Would love not to use code deploy for this purpose).

thule0 commented 2 years ago

@mrcrowl Well yes. It's a blue-green rollout with 200% max capacity, so the old tasks (task, honestly) never went down. It tried new deployment for a while and aborted after 10 failed tasks, as advertised.

Now if only we were able to change that value (10), it would be great.

kszarlej commented 2 years ago

@vibhav-ag Can you confirm that you released the changes :)?

vibhav-ag commented 2 years ago

@kszarlej, yes we have. Please let us know if you face any issues.

agent0 commented 2 years ago

Hi, unfortunately I cannot confirm that the circuit breaker works as expected. Last night, I had my ECS service trying for 24 hours to deploy an inherently broken container. The setup is pretty much the same as described above: the app is a Spring boot application that runs behind an ALB with health checks enabled. Due to the general startup time of a Spring application, it takes a few seconds for the container to fail. During that time, however, no valid health checks should be possible. From the explanations above I conclude, that the circuit breaker considers a task as healthy if it runs for a certain amount of time. Is this correct and could it be the case, that the containers are considered healthy because of the longer startup times. And if so, is there anything we can do about it? We are also facing the issue that or ECR and ECS are in different accounts so the image pulling is not free...

vibhav-ag commented 2 years ago

@agent0 my apologies for missing responding to this sooner. This is surprising- circuit breaker should have triggered in this case. Could you please send me an email at agvibhav@amazon.com? I would really like to dig into this.

whereisaaron commented 1 year ago

We reported this issue to support last month. We have containers that when something is wrong, fail within 0-2 seconds. Some tasks get marked as RUNNING at ~1 second but never passed a single health check. For some reason, Circuit Breaker doesn't count this. In fact Circuit Breaker seems to see this 1 second RUNNING as a reason to never kick in. An infinite deployment carries on after that. Just as @SkySails describes:

https://github.com/aws/containers-roadmap/issues/1206#issuecomment-1075349834

"Tasks that successfully reaches a RUNNING state, be it for a few seconds, has to fail the associated health checks. If the task exits before it has any chance to answer to any of the health checks, an infinite loop begins."

AWS Support said:

I have forwarded the issue to the internal team, who confirmed that it is a known edge case issue that happens when some tasks started by the deployment fail immediately after reaching a running state (as in your case there were few tasks which came to running state for ~1-2 seconds and then failed).
I apologize for any inconvenience this may have caused you. The service team is working on a fix for this issue. However, I will not be able to provide any ETA for when the fix would be deployed.

So I came looking and found this. Great it has been identified and is being worked on. But I would have hoped a solution would arrive before 2 years of working on it 😅

Is there any workaround? Can we adjust our container image so ECS doesn't think it is "RUNNING" for a couple of seconds so we can fail first? Could AWS adjust the Circuit Breaker enabled state to cause ECS to inject a wait for 2 seconds before activating the "RUNNING" status, so we can fail first?

skaylink-stefan-heitmueller commented 1 year ago

Encountering the same. Service which fails in entrypoint script, no container health checks, no alb health checks (running behind traefik).

failedTasks are being counted properly, but the deployment never fails.

Example after one hour:

{
    "id": "ecs-svc/...",
    "status": "PRIMARY",
    "taskDefinition": "arn:aws:ecs:eu-central-1:...:task-definition/...-broken:1",
    "desiredCount": 1,
    "pendingCount": 1,
    "runningCount": 0,
    "failedTasks": 73,
    "createdAt": "2023-01-06T08:54:51.585000+00:00",
    "updatedAt": "2023-01-06T10:10:51.782000+00:00",
    "capacityProviderStrategy": [
      {
        "capacityProvider": "FARGATE_SPOT",
        "weight": 1,
        "base": 0
      }
    ],
    "platformVersion": "1.4.0",
    "networkConfiguration": {
      "awsvpcConfiguration": {
        "subnets": [
          "...",
          "..."
        ],
        "securityGroups": [
          "..."
        ],
        "assignPublicIp": "DISABLED"
      }
    },
    "rolloutState": "IN_PROGRESS",
    "rolloutStateReason": "ECS deployment ecs-svc/... in progress."
  }

According to the docs, this all happens in stage 1. Threshold should be 10.

Edit: Just tried with container health check command and the circuit breaker kicks in 👍

rafaelsales commented 1 year ago

Just spent a day troubleshooting this. In my case, I don't have a container healthcheck, but in my understanding that's not a pre-requisite for the circuit breaker to work. The task fails with "Essential container in task exited" because the container entrypoint failed to execute - a couple of retries should indicate that circuitbreaker should be triggered.

paul-lupu commented 1 year ago

Plus one for this; the circuit breaker should trigger if anything is preventing the task with the new task definition from starting. Kind of useless otherwise.

jumpinjan commented 9 months ago

Any updates on this work?

vibhav-ag commented 9 months ago

We've enhanced circuit breaker to be more responsive by default.

https://aws.amazon.com/about-aws/whats-new/2024/01/amazon-ecs-deployment-monitoring-responsiveness-services/

gshpychka commented 9 months ago

We've enhanced circuit breaker to be more responsive by default.

https://aws.amazon.com/about-aws/whats-new/2024/01/amazon-ecs-deployment-monitoring-responsiveness-services/

does essential container exiting count as a failure now?

aws / containers-roadmap