[ECS/Fargate+ALB] [request]: grace period configuration for unhealthy tasks

soukicz commented 5 years ago

Tell us about your request Add "grace period" configuration to configure delay between detecting unhealthy task and stopping it.

Which service(s) is this request for? ECS/Fargate with ALB

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? If there is a problem with task and it becomes unhealthy, load balancer stops sending traffic to this task. It will resume after task becomes healthy again. This works as expected.

Problem is that ECS will immediately stop task once it is marked as unhealthy. I can slow this process by configuring draining period but once the task begins draining, it won't come back.

This is especially problematic with traffic spikes. Few tasks become overloaded and are marked as unhealthy – that is correct (I don’t want to route traffic to overloaded tasks). They would recover after few seconds but problem is that they are stopped and that makes the traffic spike worse because now even more requests is routed to other containers and it falls like domino.

Are you currently working around this issue? It could be solved by not connecting ECS service to load balancer and handling target registration and task stopping from Lambda but it seems over overengineered.

We are currently rate limiting requests in containers and dropping those over limit. It would be far more practical to mark container as unhealthy to stop it from receiving more traffic.

Additional context Might be relevant to #289 and #251

pavneeta commented 3 years ago

Hi @soukicz Thanks for your valuable feedback, it helps us make Amazon ECS better. I wanted to clarify a couple of things to better understand the problem you are facing: Here is my understand of the problem: So when you have a ECS task that gets overloaded due to traffic spikes (too many requests being routed to the container) - It will start failing the ALB Healthchecks, become unhealthy and ECS will de-register the task from the ALB target group and then terminate the task (to be replaced by a new one based on the desired count and scaling policies).

Ideally, when the task is de-registered, it would not have new incoming requests, it will be able to work through the existing requests, recover and start passing the health checks - correct ?
When the task is de-registered from the ALB, wouldn't traffic anyway start getting routed to the other tasks - potentially overloading them as well? Not sure how the grace period would solve that problem?
Is the problem that if the ECS will keep terminating your tasks - then you might lose too many tasks too quickly and potentially cause service disruptions ? Potentially, go from 100 tasks to 0 tasks due to the traffic spike ?
In that case seems to me like you want some kind of a circuit breaker kind that prevents tasks from being killed beyond a certain point? The grace period could also solve the tasks being terminated quickly but how will determine the grace period for these tasks then, if the grace period > the time to launch a new task, what would be your desired behavior?

Also, are you scaling your ECS service based on input metrics like the no. of load balancer requests or output metrics like CPU/memory utilization ?

genbit commented 12 months ago

We have shipped an improvement to ECS scheduler, that would prioritize starting a new healthy tasks, before killing tasks that were marked unhealthy. You can read more in this WNP: https://aws.amazon.com/about-aws/whats-new/2023/10/amazon-ecs-applications-resiliency-unpredictable-load-spikes/ Blog post with deep dive: https://aws.amazon.com/blogs/containers/a-deep-dive-into-amazon-ecs-task-health-and-task-replacement/

aws / containers-roadmap

[ECS/Fargate+ALB] [request]: grace period configuration for unhealthy tasks #480