Open soukicz opened 5 years ago
Hi @soukicz Thanks for your valuable feedback, it helps us make Amazon ECS better. I wanted to clarify a couple of things to better understand the problem you are facing: Here is my understand of the problem: So when you have a ECS task that gets overloaded due to traffic spikes (too many requests being routed to the container) - It will start failing the ALB Healthchecks, become unhealthy and ECS will de-register the task from the ALB target group and then terminate the task (to be replaced by a new one based on the desired count and scaling policies).
Also, are you scaling your ECS service based on input metrics like the no. of load balancer requests or output metrics like CPU/memory utilization ?
We have shipped an improvement to ECS scheduler, that would prioritize starting a new healthy tasks, before killing tasks that were marked unhealthy. You can read more in this WNP: https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-ecs-applications-resiliency-unpredictable-load-spikes/ Blog post with deep dive: https://aws.amazon.com/blogs/containers/a-deep-dive-into-amazon-ecs-task-health-and-task-replacement/
Tell us about your request Add "grace period" configuration to configure delay between detecting unhealthy task and stopping it.
Which service(s) is this request for? ECS/Fargate with ALB
Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? If there is a problem with task and it becomes unhealthy, load balancer stops sending traffic to this task. It will resume after task becomes healthy again. This works as expected.
Problem is that ECS will immediately stop task once it is marked as unhealthy. I can slow this process by configuring draining period but once the task begins draining, it won't come back.
This is especially problematic with traffic spikes. Few tasks become overloaded and are marked as unhealthy – that is correct (I don’t want to route traffic to overloaded tasks). They would recover after few seconds but problem is that they are stopped and that makes the traffic spike worse because now even more requests is routed to other containers and it falls like domino.
Are you currently working around this issue? It could be solved by not connecting ECS service to load balancer and handling target registration and task stopping from Lambda but it seems over overengineered.
We are currently rate limiting requests in containers and dropping those over limit. It would be far more practical to mark container as unhealthy to stop it from receiving more traffic.
Additional context Might be relevant to #289 and #251