aws / containers-roadmap

This is the public roadmap for AWS container services (ECS, ECR, Fargate, and EKS).
https://aws.amazon.com/about-aws/whats-new/containers/
Other
5.22k stars 321 forks source link

ECS capacity provider for an empty ECS cluster. #653

Closed rubashanti closed 1 year ago

rubashanti commented 4 years ago

Hello

I have noticed that when I create a capacity provider for an empty ECS cluster, and having the target capacity any value less than 100 ( I tested using 90% target capacity ), the Asg will keep creating 2 instances and deleting them, and stuck in this loop. However I'm expecting no instances are lunched as there is no tasks yet started.

but when I changed the target capacity to be 100, the ASG terminated the instances and didn't start new instances after that. when there are no tasks running and there are no ECS instances started, the capacity provider metric is 100%.

oleg-z commented 4 years ago

I've seen similar behavior. More over I can't place any tasks in ECS cluster because cluster state is 'in provisioning' even through more than enough resources is available. It's completely not clear (at least to me) how value of capacity metric is calculated.

tekaeren commented 4 years ago

@coultn I've encountered same situation as described above. ECS cluster is stable, all tasks started and deployed to dedicated ECS instances. There is no Capacity provider assigned to the ECS cluster. Now adding new capacity provider results with capacity metric that jumps from 0 to 100 every ten-something minutes and then drops to 0 when connected ASG starts it's instances. After 15 minutes, when metric alarm scales-out, the metric jumps again to 100.

So in balanced and stable cluster, we've got ECS instances scaling-in and -out in 15min cycle, burning money.

coultn commented 4 years ago

If you set a target capacity to less than 100%, you are requesting to have spare capacity in your cluster. When there are no tasks running, the scaling policy will still periodically add instances in an attempt to maintain spare capacity. If you want your instances to scale completely down to zero when no tasks are running, you should use a target capacity of 100%.

tekaeren commented 4 years ago

@coultn I agree, setting target capacity to 100[%] will keep instances down to zero. But it seems that since the CapacityProviderReservation metric is of percent type, it is capped to 100. Adding task to such empty (both not running tasks, and zero instances => meaning 100% CapacityProviderReservation) will not raise the CapacityProviderReservation over 100 and thus the scale-out alert is not triggering on connected ASG.

So adding task to such cluster will fail with message service [ecs_service] was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster.

For me it seems that CapacityProviderReservation metric must allow values over 100 for over provisioned tasks capacity in cluster.

It would be also nice to know how theCapacityProviderReservation metric is calculated - that would help with utilizing capacity to full extent.

It seems that CapacityProviderReservation metric does take into account only deployed/started tasks (used cluster capacity) and not the ones waiting in the queue for free resources.

tekaeren commented 4 years ago

@coultn To bypass the 100% value of CapacityProviderReservation when no instances are running in ASG, I've assigned to ASG scale-up action, scaling policy triggered by the same alarm that triggers scaling of ECS task to run on the ECS cluster. That works - the task are starting and ECS instances are spawning. But this also breaks Capacity provider all the way - with the ASG having attached another policy Capacity Provider wizard does create only the CloutWatch metric, but none of the alerts or scaling policies for the ASG to do the scaling (yes, I create the Capacity Provider with managed option turned on).

It seems like the Capacity Provider has not been tested, and it is not ready for usage at all.

coultn commented 4 years ago

@tekaeren The capacity provider reservation can and will go above 100 if you have tasks in the provisioning state, and it does already function that way. It can and will scale your ASG out from zero instances. The error message you are seeing will happen if you have not configured managed scaling for the capacity provider, or if you are using launch type instead of a capacity provider strategy for running your tasks and you have no instances in your cluster. Can you share the details of your specific configuration? Feel free to email me ncoult AT amazon.com.

We will be publishing a deep dive blog that covers how the metric is calculated, but the simple version of it is that CapacityProviderReservation = M/N x 100, where N = the number of instances already in your ASG, and M = the estimated number of instances required to run all existing and provisioning tasks. (A provisioning task is a task that was run using a capacity provider strategy and was assigned to a capacity provider that did not have sufficient capacity to run the task immediately. Tasks run using launch type will not reach this state).

If you have provisioning tasks assigned to that capacity provider then M>=1. If you have no provisioning tasks, then M=the number of instances running at least one non-daemon service task.

Special cases: If N=0 and M>0, then CapacityProviderReservation = 200. If N=0 and M=0, then CapacityProviderReservation = 100.

coultn commented 4 years ago

@tekaeren You can watch the Re:invent session with a demo that covers scaling out from zero: https://youtu.be/v9xuKAdShFw

tekaeren commented 4 years ago

@coultn Thanks for the explanation. It is very useful to confront with what I see. Having ECS cluster with default provider set to the same one as the service's capacity provider strategy (weight 100) and capacity provider set to 100, what I see is:

During the time, when there was capacity available I've tried to force task deploy, and even add another service associated with the capacity provider, neither action did provide working container.

tekaeren commented 4 years ago

As strange as it can be with technology, but the capacity provider metric started to work as advertised overnight. No changes to infrastructure, configuration, etc... Some new feature release hit the part of backend I'm using? Good that it works now.

krzysztof-magosa commented 4 years ago

I also had similar problems and they magically self-resolved ;)

kgyovai commented 4 years ago

The way in which the capacity provider manages autoscaling when the number of ECS instances registered to the cluster is zero seems problematic to me. Take a look at the CapacityProviderReservation graph shown below.

image

Point ECS Instances Desired Tasks Running Tasks Pending Tasks Provisioning Tasks
A 0 2 0 2 0
B 0 2 0 0 0
C 2 2 0 2 0
D 2 2 1 1 0
E 2 2 2 0 0

Notes:

  1. Between A and B, ECS seems to give up trying to place tasks. As @tekaeren described, ECS provides the message service [service_name] was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your capacity provider. and subsequently sets the number of pending tasks to zero.
  2. At B, I manually increased the desired capacity of the autoscaling group from 0 to 2.
  3. At no point were any of the tasks marked as "provisioning". Which, based on my understanding of how the capacity provider works, is a requirement for (or indication of - depending on the implementation) the capacity provider to initiate a scaling event (i.e. scale out so the 2 desired tasks can be placed).

It seems that the capacity provider will only mark a task as "provisioning" if there are candidate ECS instances in the cluster. If there are no instances in the cluster, the tasks to not appear to be marked as "provisioning"; which seems odd because having zero instances in the cluster and 2 desired tasks warrants scaling out.

Additional Details:

coultn commented 4 years ago

@kgyovai Are you using a capacity provider strategy with the service? If you use launch type, the tasks will NOT go to provisioning and you will see the error you are seeing. From the CLI, using a capacity provider strategy for a serrvice looks like this (left out a few unrelated fields for the sake of clarity): aws ecs create-service --cluster <cluster name> --service-name <your service name> --capacity-provider-strategy CapacityProvider=<provide name>

What does your service show as the capacity provider strategy when you use the describe-service API? If it does not show a capacity provider strategy then are you are not using one, you are just using launch type, which is the older way to create services and run tasks. This means your tasks will not go to provisioning and will not trigger scaling. We provided this option for backwards-compatibility purposes.

kgyovai commented 4 years ago

@coultn - Thanks for that clarification. My service was in fact configured to use a launch type rather than the capacity provider strategy. I have applied the required change for that.

Does the capacity provider implicitly manage the desired capacity of the ASG? The ASG desired capacity doesn't reflect the number of actual ECS instances that the capacity provider is managing. What is the relationship between the desired capacity of the ASG and the "desired size" of the capacity provider? Should they match?

coultn commented 4 years ago

When you enable managed scaling with a capacity provider, ECS creates a scaling policy that uses the capacity provider reservation metric. The scaling policy scales the ASG. So, the capacity provider indirectly manages the desired capacity of the ASG. Usually the ASG will scale to the size that the scaling policy is requesting, but not always:

You can read more about the scaling policy and the metric here, in a blog post we published earlier this month:

https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/

kgyovai commented 4 years ago

@coultn - As stated in the deep-dive documentation that you provided,

"Target values less than 100 enable spare capacity in the ASG. For example, if you set the target value to 50, the scaling policy will try to adjust N so that the equation M / N X 100 = 50 is true."

The target capacity for the capacity provider assigned to my cluster is set to 100 (this is a percentage; not a number of instances - which is not made clear in the API docs) as shown below.

               "managedScaling": {
                    "status": "ENABLED",
                    "targetCapacity": 100,
                    "minimumScalingStepSize": 1,
                    "maximumScalingStepSize": 2
                },
                "managedTerminationProtection": "ENABLED"

The ASG managed by the capacity provider is currently showing the following values: image

The ECS Service definition is requesting tasks as shown: image

Given the equation from the "deep-dive" documentation: image

Where: M = The "right" number of running EC2 instances in the ECS cluster N = The current number of running EC2 instances in the ECS cluster

My cluster has been stuck at a CapacityProviderReservation value of 66.67 for several days. I can see the CloudWatch alarm that has been active during that period.

image

Can you explain why a scale-in event hasn't occurred in order to make M = N? By setting the target capacity to 100, I have not requested extra capacity.

The only thing that I can think of is this:

"What if your ASG uses multiple instance types or isn’t confined to a single-AZ? In that case, the algorithm described above isn’t necessarily a lower bound, so CAS falls back to a much simpler approach: M = N + minimumScalingStepSize."

Since my ASG is multi-AZ and the value of minimumScalingStepSize is set to 1, does that mean that my cluster will always have excess capacity?

Does setting minimumScalingStepSize to 0 even make sense?

coultn commented 4 years ago

@kgyovai Thanks for the detailed feedback. From what I can tell, your ASG should be scaling in, but it is likely being prevented from doing so because the termination protection flags on the instances are not being removed. Can you see if the ASG has any tags? If so what are they?

kgyovai commented 4 years ago

@coultn See below for the tags that are applied to the ASG.

image

Is the AmazonECSManaged tag applied by the capacity provider? That is the only tag of the 4 that I did not personally create.

The EC2 instances themselves have a similar tag with a key of AmazonECSManaged and an empty value.

image

mustanggb commented 4 years ago

I'm having this issue as well, running 2 services/tasks.

With...

managedScaling.targetCapacity: 90
managedScaling.status: ENABLED
managedTerminationProtection: DISABLED

...it works as expected and I have 3 instances, 1 for each service/task, and 1 spare ready to go.

However with...

managedScaling.targetCapacity: 90
managedScaling.status: ENABLED
managedTerminationProtection: ENABLED

...it does not work as expected and I have 4 instances, 1 for each service/task, and 2 spare. My AlarmLow is firing, but it's not removing the scale in protection from either of the unused instances and scaling down to 3.

gbataille commented 4 years ago

FYI, another nasty possible error I found. I had the same issue for a while and so did some digging around.

I was using terraform. The doc says that in the capacity_provider field, you can put either the name or the ARN. It seems that it's not the case and that you have to put the name. Using the ARN does not work. It seems that

I'm raising it on the terraform side too: https://github.com/terraform-providers/terraform-provider-aws/issues/11817

emdotem commented 4 years ago

Hello, I've two bugs:

Did you solve this issue?

mailjunze commented 4 years ago

@emdotem

Have you enabled scaled in protection on ASG. Is there any instance on the ASG running with scale-in protection disabled ? When you create a CP, an ASG should have :

scale-in protection enabled. No instance with scale-in protection disabled running.

Capacity provider takes around ~15 min to scale in an instance when no tasks(except daemon) are running.

ctrongminh commented 4 years ago

Hi @emdotem

iam-j commented 4 years ago

Hi, I have similar issue if the target is set to 100, the Cluster would not scale out and tasks fail. If the target is set to 90, i am always left with spare capacity causing 1 or 2 machines to span up.

A working solution would be great

MatteoInfi commented 4 years ago

I think I should read more on this topic, but right now my capacity provider shows 100% all the time, even though my task is taking a small fraction of the resources available...

PS: Actually according to the formula it makes sense, as I have 1 task (N) and 1 instance (M) therefore N/M *100 = 100, but the formula itself does not make sense why I have to be forced to have another instance running while my task is consuming close to nothing? Why not scaling accordingly to the resources provided? This Capacity provider seems a bit flawed...

MikeKroell commented 4 years ago

I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.

iam-j commented 4 years ago

I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.

As mentioned here the service should use Capacity provider strategy in order to show provisioning instead of insufficient capacity

https://github.com/aws/containers-roadmap/issues/653#issuecomment-575215795

MikeKroell commented 4 years ago

I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.

As mentioned here the service should use Capacity provider strategy in order to show provisioning instead of insufficient capacity

#653 (comment)

We have this setup, but our application launches singular tasks and does not use a service. The tasks fail without going into provisioning.

kivan-mih commented 4 years ago

I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.

Hi! I have the same concern as Mike. The uneditable value of 300 seconds for instance warmup with the Capacity Provider managed scaling policy, is very strange not only by itself, but also we have strange behavior with the real cluster. As you can see in the picture, when the cluster tries to scale, it adds more and more instances due to 300 warmup and then needs to delete them to enter the steady state. Kinda weird. image

MatteoInfi commented 4 years ago

@kivan-mih My suggestion is to not use Capacity provider, they are not production ready and the current design is flawed

kivan-mih commented 4 years ago

@kivan-mih My suggestion is to not use Capacity provider, they are not production ready and the current design is flawed

Actually we had much worse behavior without capacity provider, so we tried it as a last hope and ... well it works quite ok, and the only concern we have are those useless spikes due to uneditable 300 seconds warmup.

venu-ibex-9 commented 4 years ago

@coultn See below for the tags that are applied to the ASG.

image

Is the AmazonECSManaged tag applied by the capacity provider? That is the only tag of the 4 that I did not personally create.

The EC2 instances themselves have a similar tag with a key of AmazonECSManaged and an empty value.

image

hi @kgyovai those tags are causing an issue?

kgyovai commented 4 years ago

@coultn See below for the tags that are applied to the ASG. image Is the AmazonECSManaged tag applied by the capacity provider? That is the only tag of the 4 that I did not personally create. The EC2 instances themselves have a similar tag with a key of AmazonECSManaged and an empty value. image

hi @kgyovai those tags are causing an issue?

@venu-ibex-9 - I didn't receive any feedback from @coultn as to whether those tags are an issue or not.

anoopkapoor commented 3 years ago

@coultn See below for the tags that are applied to the ASG. image Is the AmazonECSManaged tag applied by the capacity provider? That is the only tag of the 4 that I did not personally create. The EC2 instances themselves have a similar tag with a key of AmazonECSManaged and an empty value. image

hi @kgyovai those tags are causing an issue?

@venu-ibex-9 - I didn't receive any feedback from @coultn as to whether those tags are an issue or not.

Hi @kgyovai and @venu-ibex-9 From a tags perspective, everything looks fine as the AmazonECSManaged tag is indeed applied by the capacity provider. Can you confirm that if you enable scale-in protection at the time of ASG creation, scale-in of the instances through capacity provider/cluster auto scaling works fine in that case?

anoopkapoor commented 3 years ago

@kivan-mih My suggestion is to not use Capacity provider, they are not production ready and the current design is flawed

Actually we had much worse behavior without capacity provider, so we tried it as a last hope and ... well it works quite ok, and the only concern we have are those useless spikes due to uneditable 300 seconds warmup.

the ability to edit the warm-up time should be coming soon as part of the ability to update capacity provider parameters. https://github.com/aws/containers-roadmap/issues/633

anoopkapoor commented 3 years ago

I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.

As mentioned here the service should use Capacity provider strategy in order to show provisioning instead of insufficient capacity #653 (comment)

We have this setup, but our application launches singular tasks and does not use a service. The tasks fail without going into provisioning.

Hi @MikeKroell can you confirm that you are indeed using a capacity provider strategy when using the runTask API? If yes, can you share the steps to reproduce the issue?

michalsw commented 3 years ago

I am experiencing the following issue. I have target_capacity = 100 so when there are 0 tasks the instance count is also 0. Tasks CPU and memory are almost equal or equal to the instance size max values, so no more than 1 task can be placed on an instance. I decreased the instanceWarmupPeriod.

"managedTerminationProtection": "ENABLED", "managedScaling": { "status": "ENABLED", "targetCapacity": 100, "maximumScalingStepSize": 10000, "instanceWarmupPeriod": 15, "minimumScalingStepSize": 1 }

When I place 99 tasks at once, all tasks go into the provisioning state. The first two EC2 instances start, but after that, the scale-out is super slow. It adds one instance at a time instead of 97.

Lots of tasks fail to start due to going above 30 min limit in the provisioning state.

The following is the CapacityProviderReservation metric that never goes above 200% when the initial 2 instances start. image

The scale-in is also slow, one instance at a time.

anoopkapoor commented 3 years ago

Hi @michalsw Do you have a CloudFormation or Terraform template that includes the task definition, service definition, ASG config etc that I can take a look at to reproduce this behavior?

anoopkapoor commented 3 years ago

@kivan-mih My suggestion is to not use Capacity provider, they are not production ready and the current design is flawed

Actually we had much worse behavior without capacity provider, so we tried it as a last hope and ... well it works quite ok, and the only concern we have are those useless spikes due to uneditable 300 seconds warmup.

the ability to edit the warm-up time should be coming soon as part of the ability to update capacity provider parameters.

633

fyi, #633 was launched and closed.

michalsw commented 3 years ago

Hi @michalsw Do you have a CloudFormation or Terraform template that includes the task definition, service definition, ASG config etc that I can take a look at to reproduce this behavior?

Hi @anoopkapoor Not at this point as the templates are part of a bigger project. I will have to create a simpler terraform template from scratch.

AbhishekNautiyal commented 1 year ago

We have fixed Capacity Provider scaling logic to resolve the previous behavior of scaling-out and scaling-in instances (sometimes repeatedly) when target capacity is set to less than 100%. For the empty cluster case (no tasks) when a Capacity Provider's target capacity is less than 100%, ECS will keep one spare instance, without cyclically scaling-in and out EC2 instances. Note that ECS may initially launch 2 instances but will stabilize down and remain at 1 instance, unless more tasks are launched that require a larger number of instances to be launched; this is the existent documented behavior and we further plan to improve it to launch 1 instance in this case. Please let us know if you encounter any other issues with ECS cluster autoscaling behavior.