aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.2k stars 316 forks source link

[ECS] [request]: Editable CapacityProviderReservation AlarmLow duration (is always 15min) #1220

Open monsieurgustav opened 3 years ago

monsieurgustav commented 3 years ago

Community Note

Tell us about your request EC2 scale in via ECS Capacity Provider is too slow ue to the hard coded 15min alarm-low. I wish I could change it to 5 or 10 min.

Which service(s) is this request for? ECS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard? I run tasks that execute jobs stored in a SQS. A job takes few minutes to execute. Most of the time, there is no job at all ; 1 job will be created randomly ; sometimes N jobs will be created at once. When all jobs are finished, I want to scale in quite fast because it is likely there won't be new job in the next few minutes.

CAS automatically creates a "CapacityProviderReservation AlarmLow" that alarms after 15min. Scale in is then very slow, compared to a few minutes job.

Fargate would be an option, but a GPU is required.

Are you currently working around this issue? I don't, I pay for idling EC2 instances.

matt-theguyw1cat commented 3 years ago

I have exactly the same issue as Guillaume. Consuming an SQS queue that receives very sporadic and bursty requests that sometimes get handled very quickly, sometimes take several minutes. I'm contemplating a Cloud Formation custom resource to automatically go in there and hack that 15 min on that alarm down to something more like 5 or even less. My tasks all self-destruct pretty much (1 + visibility) mins after zero messages are found, so I want to container instances to pretty die a minute or 3 after that.

nshi commented 1 year ago

Similar use case here. We use EC2-backed ECS because our jobs tend to run for a couple of minutes (more than 2 minutes) and they require GPUs. We have very predictable hourly spikes in traffic. For example, traffic increases by 100x between hh:50 to hh:15 every hour. We want to be able to scale the cluster in along with all the underlying EC2 instances immediately after the traffic goes down.

Due to the 15 minute delay in the CloudWatch alarm for the underlying ASG, we have to pay for the idling 15 minutes, which is 25% of each hour. We end up paying for a big chunk of idling period.

Ideally we would like to use a different scaling policy than the ECS managed target tracking policy, but that's not allowed. Alternatively, we'd like to be able to modify the samples needed to trigger the alarm so it's not always 15.

SauravCR7 commented 11 months ago

Hi, we have a similar use case too where we are using ECS with EC2 ASGs with GPU capabilities. We have a cloudwatch alarm which gets triggered when messages are received in SQS, this alarm is used to scale our tasks in ECS. Messages can be received in bursts or in very sporadic time periods and waiting for instances to scale down for 15 mins seems too much to ask for, adding to costs.

We would like to have the number of datapoints for downscale to be customizable by us so that we don't have to wait for 15 mins for instance downscaling.