Draining instances should deregister their available MEM/CPU.

aws / amazon-ecs-agent

Amazon Elastic Container Service Agent

http://aws.amazon.com/ecs/

Apache License 2.0

2.08k stars 612 forks source link

Draining instances should deregister their available MEM/CPU. #718

Closed maartenvanderhoef closed 7 years ago

maartenvanderhoef commented 7 years ago

An ECS Instance in draining state should deregister the available memory and cpu to the cluster. In an autoscaling scenario this would then cause the CPUReservation and MemoryReservation to cross the alarm causing fresh nodes to be added automatically.

liwenwu-amazon commented 7 years ago

Hi,

ECS already deregisters the reservation metrics for a draining instance. However, the instance status updates server-side once every 10-15 minutes. This means that the instance's available cpu and mem will still be marked as available for up to 15 minutes.

I understand that it would be better for this process to be more timely. Could you describe your use case?

aaithal commented 7 years ago

Hi @maartenvanderhoef Thank you for reporting this issue. An instance which is in DRAINING state does get discounted from CPUReservation and MemoryReservation. However, this does not take into effect immediately and does take some time for the reconciliation to happen in metrics. Could you please answer the following questions to help us better address this issue?

Are you seeing DRAINING instances always accounting for these metrics and never being discounted for this calculation?
If not, is the delay in reconciliation causing you issues and is that the actual issue that you're concerned about?

Thanks, Anirudh

samuelkarp commented 7 years ago

@maartenvanderhoef We haven't heard from you in a while, so I'm going to close this issue. If you get a chance to respond to @aaithal's questions we'd be happy to reopen it.

dmulter commented 6 years ago

I have the same issue. My use case is the following:

Create an ASG for version 1.0.0 with scale-in/out based on memory reservation.
Things running fine for a while.
Deploy a new ASG version 1.0.1 with same scaling.
Mark tasks on old ASG as draining.
Wait for all old tasks to migrate to new ASG.
PROBLEM description below...
Cleanup all ASG 1.0.0 resources

So you can see the problem is that the new ASG won't scale-out to allow the old tasks to migrate to the new ASG. The old ASG memory is caused the scale-out limit to never be reached.

apottere commented 6 years ago

I'm running into this myself - we had an issue where our private docker registry auth was invalidated so we had to restart the whole cluster to pick up new credentials. I started by marking every instance in the cluster as "draining" (since they can't start new tasks anyway), but the autoscaling group only spins up new EC2 instances after I terminate one. It's been like this for about an hour rolling through the whole cluster, so cloudwatch clearly doesn't take the draining state into account when calculating reserved CPU/MEM, which is what we trigger on to autoscale the cluster.