aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

[ECS/Fargate+ALB] [request]: grace period configuration for unhealthy tasks #480

Open soukicz opened 5 years ago

soukicz commented 5 years ago

Tell us about your request Add "grace period" configuration to configure delay between detecting unhealthy task and stopping it.

Which service(s) is this request for? ECS/Fargate with ALB

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? If there is a problem with task and it becomes unhealthy, load balancer stops sending traffic to this task. It will resume after task becomes healthy again. This works as expected.

Problem is that ECS will immediately stop task once it is marked as unhealthy. I can slow this process by configuring draining period but once the task begins draining, it won't come back.

This is especially problematic with traffic spikes. Few tasks become overloaded and are marked as unhealthy – that is correct (I don’t want to route traffic to overloaded tasks). They would recover after few seconds but problem is that they are stopped and that makes the traffic spike worse because now even more requests is routed to other containers and it falls like domino.

Are you currently working around this issue? It could be solved by not connecting ECS service to load balancer and handling target registration and task stopping from Lambda but it seems over overengineered.

We are currently rate limiting requests in containers and dropping those over limit. It would be far more practical to mark container as unhealthy to stop it from receiving more traffic.

Additional context Might be relevant to #289 and #251

pavneeta commented 3 years ago

Hi @soukicz Thanks for your valuable feedback, it helps us make Amazon ECS better. I wanted to clarify a couple of things to better understand the problem you are facing: Here is my understand of the problem: So when you have a ECS task that gets overloaded due to traffic spikes (too many requests being routed to the container) - It will start failing the ALB Healthchecks, become unhealthy and ECS will de-register the task from the ALB target group and then terminate the task (to be replaced by a new one based on the desired count and scaling policies).

Also, are you scaling your ECS service based on input metrics like the no. of load balancer requests or output metrics like CPU/memory utilization ?

genbit commented 12 months ago

We have shipped an improvement to ECS scheduler, that would prioritize starting a new healthy tasks, before killing tasks that were marked unhealthy. You can read more in this WNP: https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-ecs-applications-resiliency-unpredictable-load-spikes/ Blog post with deep dive: https://aws.amazon.com/blogs/containers/a-deep-dive-into-amazon-ecs-task-health-and-task-replacement/