aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.21k stars 316 forks source link

[ECS] [request]: Faster Target tracking scaling policies scale out #2113

Open jeroenhabets opened 1 year ago

jeroenhabets commented 1 year ago

Community Note

Tell us about your request Currently ECS will wait 15+ minutes (default Cloudwatch metrics) or 5-6 minutes (detailed Cloudwatch metrics) before triggering a scale out which is way too slow to prevent issues for end users. This should be made much much faster to handle increased load without negatively impacting users.

One way would be to add "evaluation-periods" option to Target tracking scaling policies to configure how many evaluation periods to wait before triggering a scale out. Another more drastic and effective solution would be to let EC2 agents proactively push increased CPU alerts to ECS/Auto scaling (and not wait for CloudWatch).

Ideally, there would be two "evaluation-periods" options one for scale out (= being fast is key) and another for scale in (= balance cost vs stability)

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? As soon as our ECS cluster comes under increased load it should be able to trigger a scale out. Currently, not possible with Target tracking scaling policies and their fixed 5 evaluation periods, and only possible with a 1+ minute delay using step scaling or faster using a custom orchestrator with agents running inside the instances/containers.

With two "evaluation-periods" options, we could set scale out periods to 1 and keep scale in periods to 5 or even 10/20.

Are you currently working around this issue? Yes, using step scaling, though it hurts to see ECS wait 1+ minute after the load has increased.

lucastonelli commented 7 months ago

I've just started using ECS and I have to say that having the compulsory scaling policy is more bothersome than I anticipated 🥲

janeglovergmsl commented 7 months ago

I don't think its unreasonable to need 3 data points to scale but having them a mandatory 1 minute apart is really slow, given one of the advantages of containers is that they can spin up so quickly and means you cannot respond fast enough to a bursty loads. You almost want a "burst" capability in ECS similar to the burst capabilities we have on some of the EC2 instance types

ponkio-o commented 5 months ago

I have the same problem. Furthermore, I find that ASG is also slow to launch instances. It would be nice if ECS had the ability to scale out faster like Karpenter

joekeilty-oub commented 2 months ago

I would be willing to pay more money to have certain ECS metrics pushed into CloudWatch at an increased frequency (higher than 1 minute).