Closed rubashanti closed 1 year ago
I've seen similar behavior. More over I can't place any tasks in ECS cluster because cluster state is 'in provisioning' even through more than enough resources is available. It's completely not clear (at least to me) how value of capacity metric is calculated.
@coultn I've encountered same situation as described above. ECS cluster is stable, all tasks started and deployed to dedicated ECS instances. There is no Capacity provider assigned to the ECS cluster. Now adding new capacity provider results with capacity metric that jumps from 0 to 100 every ten-something minutes and then drops to 0 when connected ASG starts it's instances. After 15 minutes, when metric alarm scales-out, the metric jumps again to 100.
So in balanced and stable cluster, we've got ECS instances scaling-in and -out in 15min cycle, burning money.
If you set a target capacity to less than 100%, you are requesting to have spare capacity in your cluster. When there are no tasks running, the scaling policy will still periodically add instances in an attempt to maintain spare capacity. If you want your instances to scale completely down to zero when no tasks are running, you should use a target capacity of 100%.
@coultn I agree, setting target capacity to 100[%] will keep instances down to zero. But it seems that since the CapacityProviderReservation metric is of percent type, it is capped to 100. Adding task to such empty (both not running tasks, and zero instances => meaning 100% CapacityProviderReservation) will not raise the CapacityProviderReservation over 100 and thus the scale-out alert is not triggering on connected ASG.
So adding task to such cluster will fail with message service [ecs_service] was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster.
For me it seems that CapacityProviderReservation metric must allow values over 100 for over provisioned tasks capacity in cluster.
It would be also nice to know how theCapacityProviderReservation metric is calculated - that would help with utilizing capacity to full extent.
It seems that CapacityProviderReservation metric does take into account only deployed/started tasks (used cluster capacity) and not the ones waiting in the queue for free resources.
@coultn To bypass the 100% value of CapacityProviderReservation when no instances are running in ASG, I've assigned to ASG scale-up action, scaling policy triggered by the same alarm that triggers scaling of ECS task to run on the ECS cluster. That works - the task are starting and ECS instances are spawning. But this also breaks Capacity provider all the way - with the ASG having attached another policy Capacity Provider wizard does create only the CloutWatch metric, but none of the alerts or scaling policies for the ASG to do the scaling (yes, I create the Capacity Provider with managed option turned on).
It seems like the Capacity Provider has not been tested, and it is not ready for usage at all.
@tekaeren The capacity provider reservation can and will go above 100 if you have tasks in the provisioning state, and it does already function that way. It can and will scale your ASG out from zero instances. The error message you are seeing will happen if you have not configured managed scaling for the capacity provider, or if you are using launch type instead of a capacity provider strategy for running your tasks and you have no instances in your cluster. Can you share the details of your specific configuration? Feel free to email me ncoult AT amazon.com.
We will be publishing a deep dive blog that covers how the metric is calculated, but the simple version of it is that CapacityProviderReservation = M/N x 100, where N = the number of instances already in your ASG, and M = the estimated number of instances required to run all existing and provisioning tasks. (A provisioning task is a task that was run using a capacity provider strategy and was assigned to a capacity provider that did not have sufficient capacity to run the task immediately. Tasks run using launch type will not reach this state).
If you have provisioning tasks assigned to that capacity provider then M>=1. If you have no provisioning tasks, then M=the number of instances running at least one non-daemon service task.
Special cases: If N=0 and M>0, then CapacityProviderReservation = 200. If N=0 and M=0, then CapacityProviderReservation = 100.
@tekaeren You can watch the Re:invent session with a demo that covers scaling out from zero: https://youtu.be/v9xuKAdShFw
@coultn Thanks for the explanation. It is very useful to confront with what I see. Having ECS cluster with default provider set to the same one as the service's capacity provider strategy (weight 100) and capacity provider set to 100, what I see is:
During the time, when there was capacity available I've tried to force task deploy, and even add another service associated with the capacity provider, neither action did provide working container.
As strange as it can be with technology, but the capacity provider metric started to work as advertised overnight. No changes to infrastructure, configuration, etc... Some new feature release hit the part of backend I'm using? Good that it works now.
I also had similar problems and they magically self-resolved ;)
The way in which the capacity provider manages autoscaling when the number of ECS instances registered to the cluster is zero seems problematic to me. Take a look at the CapacityProviderReservation graph shown below.
Point | ECS Instances | Desired Tasks | Running Tasks | Pending Tasks | Provisioning Tasks |
---|---|---|---|---|---|
A | 0 | 2 | 0 | 2 | 0 |
B | 0 | 2 | 0 | 0 | 0 |
C | 2 | 2 | 0 | 2 | 0 |
D | 2 | 2 | 1 | 1 | 0 |
E | 2 | 2 | 2 | 0 | 0 |
Notes:
service [service_name] was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your capacity provider.
and subsequently sets the number of pending tasks to zero.It seems that the capacity provider will only mark a task as "provisioning" if there are candidate ECS instances in the cluster. If there are no instances in the cluster, the tasks to not appear to be marked as "provisioning"; which seems odd because having zero instances in the cluster and 2 desired tasks warrants scaling out.
Additional Details:
maximum_scaling_step_size = 2
minimum_scaling_step_size = 1
target_capacity = 100
max_size = 4
min_size = 0
desired_capacity = 0
@kgyovai Are you using a capacity provider strategy with the service? If you use launch type, the tasks will NOT go to provisioning and you will see the error you are seeing. From the CLI, using a capacity provider strategy for a serrvice looks like this (left out a few unrelated fields for the sake of clarity): aws ecs create-service --cluster <cluster name> --service-name <your service name> --capacity-provider-strategy CapacityProvider=<provide name>
What does your service show as the capacity provider strategy when you use the describe-service
API? If it does not show a capacity provider strategy then are you are not using one, you are just using launch type, which is the older way to create services and run tasks. This means your tasks will not go to provisioning and will not trigger scaling. We provided this option for backwards-compatibility purposes.
@coultn - Thanks for that clarification. My service was in fact configured to use a launch type rather than the capacity provider strategy. I have applied the required change for that.
Does the capacity provider implicitly manage the desired capacity of the ASG? The ASG desired capacity doesn't reflect the number of actual ECS instances that the capacity provider is managing. What is the relationship between the desired capacity of the ASG and the "desired size" of the capacity provider? Should they match?
When you enable managed scaling with a capacity provider, ECS creates a scaling policy that uses the capacity provider reservation metric. The scaling policy scales the ASG. So, the capacity provider indirectly manages the desired capacity of the ASG. Usually the ASG will scale to the size that the scaling policy is requesting, but not always:
You can read more about the scaling policy and the metric here, in a blog post we published earlier this month:
https://aws.amazon.com/blogs/containers/deep-dive-on-amazon-ecs-cluster-auto-scaling/
@coultn - As stated in the deep-dive documentation that you provided,
"Target values less than 100 enable spare capacity in the ASG. For example, if you set the target value to 50, the scaling policy will try to adjust N so that the equation M / N X 100 = 50 is true."
The target capacity for the capacity provider assigned to my cluster is set to 100 (this is a percentage; not a number of instances - which is not made clear in the API docs) as shown below.
"managedScaling": {
"status": "ENABLED",
"targetCapacity": 100,
"minimumScalingStepSize": 1,
"maximumScalingStepSize": 2
},
"managedTerminationProtection": "ENABLED"
The ASG managed by the capacity provider is currently showing the following values:
The ECS Service definition is requesting tasks as shown:
Given the equation from the "deep-dive" documentation:
Where: M = The "right" number of running EC2 instances in the ECS cluster N = The current number of running EC2 instances in the ECS cluster
My cluster has been stuck at a CapacityProviderReservation
value of 66.67 for several days. I can see the CloudWatch alarm that has been active during that period.
Can you explain why a scale-in event hasn't occurred in order to make M = N? By setting the target capacity to 100, I have not requested extra capacity.
The only thing that I can think of is this:
"What if your ASG uses multiple instance types or isn’t confined to a single-AZ? In that case, the algorithm described above isn’t necessarily a lower bound, so CAS falls back to a much simpler approach: M = N + minimumScalingStepSize."
Since my ASG is multi-AZ and the value of minimumScalingStepSize
is set to 1, does that mean that my cluster will always have excess capacity?
Does setting minimumScalingStepSize
to 0 even make sense?
@kgyovai Thanks for the detailed feedback. From what I can tell, your ASG should be scaling in, but it is likely being prevented from doing so because the termination protection flags on the instances are not being removed. Can you see if the ASG has any tags? If so what are they?
@coultn See below for the tags that are applied to the ASG.
Is the AmazonECSManaged
tag applied by the capacity provider? That is the only tag of the 4 that I did not personally create.
The EC2 instances themselves have a similar tag with a key of AmazonECSManaged
and an empty value.
I'm having this issue as well, running 2 services/tasks.
With...
managedScaling.targetCapacity: 90
managedScaling.status: ENABLED
managedTerminationProtection: DISABLED
...it works as expected and I have 3 instances, 1 for each service/task, and 1 spare ready to go.
However with...
managedScaling.targetCapacity: 90
managedScaling.status: ENABLED
managedTerminationProtection: ENABLED
...it does not work as expected and I have 4 instances, 1 for each service/task, and 2 spare. My AlarmLow
is firing, but it's not removing the scale in protection from either of the unused instances and scaling down to 3.
FYI, another nasty possible error I found. I had the same issue for a while and so did some digging around.
I was using terraform. The doc says that in the capacity_provider
field, you can put either the name
or the ARN
. It seems that it's not the case and that you have to put the name
. Using the ARN
does not work.
It seems that
I'm raising it on the terraform side too: https://github.com/terraform-providers/terraform-provider-aws/issues/11817
Hello, I've two bugs:
The managed termination protection setting for the capacity provider is invalid. To enable managed termination protection for a capacity provider, the Auto Scaling group must have instance protection from scale in enabled.
However, I've set the Termination protection on the ASG properly.
Did you solve this issue?
@emdotem
Have you enabled scaled in protection on ASG. Is there any instance on the ASG running with scale-in protection disabled ? When you create a CP, an ASG should have :
scale-in protection enabled. No instance with scale-in protection disabled running.
Capacity provider takes around ~15 min to scale in an instance when no tasks(except daemon) are running.
Hi @emdotem
Problem 1: you need to have "Protect from Scale in" on in ASG first before associating this ASG with ECS Capacity.
Problem 2: As per https://github.com/aws/containers-roadmap/issues/633, ECS Cap is currently immutable object. Once created it cannot be modified or deleted. You will need to create another ASG to create another ECS Cap. You cant create a new ECS cap and associate with the ASG which is already associated.
Hi, I have similar issue if the target is set to 100, the Cluster would not scale out and tasks fail. If the target is set to 90, i am always left with spare capacity causing 1 or 2 machines to span up.
A working solution would be great
I think I should read more on this topic, but right now my capacity provider shows 100% all the time, even though my task is taking a small fraction of the resources available...
PS: Actually according to the formula it makes sense, as I have 1 task (N) and 1 instance (M) therefore N/M *100
= 100, but the formula itself does not make sense why I have to be forced to have another instance running while my task is consuming close to nothing?
Why not scaling accordingly to the resources provided?
This Capacity provider seems a bit flawed...
I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.
I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.
As mentioned here the service should use Capacity provider strategy in order to show provisioning instead of insufficient capacity
https://github.com/aws/containers-roadmap/issues/653#issuecomment-575215795
I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.
As mentioned here the service should use Capacity provider strategy in order to show provisioning instead of insufficient capacity
We have this setup, but our application launches singular tasks and does not use a service. The tasks fail without going into provisioning.
I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.
Hi! I have the same concern as Mike. The uneditable value of 300 seconds for instance warmup with the Capacity Provider managed scaling policy, is very strange not only by itself, but also we have strange behavior with the real cluster. As you can see in the picture, when the cluster tries to scale, it adds more and more instances due to 300 warmup and then needs to delete them to enter the steady state. Kinda weird.
@kivan-mih My suggestion is to not use Capacity provider, they are not production ready and the current design is flawed
@kivan-mih My suggestion is to not use Capacity provider, they are not production ready and the current design is flawed
Actually we had much worse behavior without capacity provider, so we tried it as a last hope and ... well it works quite ok, and the only concern we have are those useless spikes due to uneditable 300 seconds warmup.
@coultn See below for the tags that are applied to the ASG.
Is the
AmazonECSManaged
tag applied by the capacity provider? That is the only tag of the 4 that I did not personally create.The EC2 instances themselves have a similar tag with a key of
AmazonECSManaged
and an empty value.
hi @kgyovai those tags are causing an issue?
@coultn See below for the tags that are applied to the ASG. Is the
AmazonECSManaged
tag applied by the capacity provider? That is the only tag of the 4 that I did not personally create. The EC2 instances themselves have a similar tag with a key ofAmazonECSManaged
and an empty value.hi @kgyovai those tags are causing an issue?
@venu-ibex-9 - I didn't receive any feedback from @coultn as to whether those tags are an issue or not.
@coultn See below for the tags that are applied to the ASG. Is the
AmazonECSManaged
tag applied by the capacity provider? That is the only tag of the 4 that I did not personally create. The EC2 instances themselves have a similar tag with a key ofAmazonECSManaged
and an empty value.hi @kgyovai those tags are causing an issue?
@venu-ibex-9 - I didn't receive any feedback from @coultn as to whether those tags are an issue or not.
Hi @kgyovai and @venu-ibex-9 From a tags perspective, everything looks fine as the AmazonECSManaged tag is indeed applied by the capacity provider. Can you confirm that if you enable scale-in protection at the time of ASG creation, scale-in of the instances through capacity provider/cluster auto scaling works fine in that case?
@kivan-mih My suggestion is to not use Capacity provider, they are not production ready and the current design is flawed
Actually we had much worse behavior without capacity provider, so we tried it as a last hope and ... well it works quite ok, and the only concern we have are those useless spikes due to uneditable 300 seconds warmup.
the ability to edit the warm-up time should be coming soon as part of the ability to update capacity provider parameters. https://github.com/aws/containers-roadmap/issues/633
I have two issues related to this. When a new task tries to start, it fails with not having enough CPU reservation. Also, when using ECS Capacity Provider, the warm-up time is 300 seconds, even though the instance and next task are launched within one minute. This value does not appear to be editable.
As mentioned here the service should use Capacity provider strategy in order to show provisioning instead of insufficient capacity #653 (comment)
We have this setup, but our application launches singular tasks and does not use a service. The tasks fail without going into provisioning.
Hi @MikeKroell can you confirm that you are indeed using a capacity provider strategy when using the runTask API? If yes, can you share the steps to reproduce the issue?
I am experiencing the following issue. I have target_capacity = 100
so when there are 0 tasks the instance count is also 0. Tasks CPU and memory are almost equal or equal to the instance size max values, so no more than 1 task can be placed on an instance. I decreased the instanceWarmupPeriod
.
"managedTerminationProtection": "ENABLED", "managedScaling": { "status": "ENABLED", "targetCapacity": 100, "maximumScalingStepSize": 10000, "instanceWarmupPeriod": 15, "minimumScalingStepSize": 1 }
When I place 99 tasks at once, all tasks go into the provisioning state. The first two EC2 instances start, but after that, the scale-out is super slow. It adds one instance at a time instead of 97.
Lots of tasks fail to start due to going above 30 min limit in the provisioning state.
The following is the CapacityProviderReservation
metric that never goes above 200% when the initial 2 instances start.
The scale-in is also slow, one instance at a time.
Hi @michalsw Do you have a CloudFormation or Terraform template that includes the task definition, service definition, ASG config etc that I can take a look at to reproduce this behavior?
@kivan-mih My suggestion is to not use Capacity provider, they are not production ready and the current design is flawed
Actually we had much worse behavior without capacity provider, so we tried it as a last hope and ... well it works quite ok, and the only concern we have are those useless spikes due to uneditable 300 seconds warmup.
the ability to edit the warm-up time should be coming soon as part of the ability to update capacity provider parameters.
633
fyi, #633 was launched and closed.
Hi @michalsw Do you have a CloudFormation or Terraform template that includes the task definition, service definition, ASG config etc that I can take a look at to reproduce this behavior?
Hi @anoopkapoor Not at this point as the templates are part of a bigger project. I will have to create a simpler terraform template from scratch.
We have fixed Capacity Provider scaling logic to resolve the previous behavior of scaling-out and scaling-in instances (sometimes repeatedly) when target capacity is set to less than 100%. For the empty cluster case (no tasks) when a Capacity Provider's target capacity is less than 100%, ECS will keep one spare instance, without cyclically scaling-in and out EC2 instances. Note that ECS may initially launch 2 instances but will stabilize down and remain at 1 instance, unless more tasks are launched that require a larger number of instances to be launched; this is the existent documented behavior and we further plan to improve it to launch 1 instance in this case. Please let us know if you encounter any other issues with ECS cluster autoscaling behavior.
Hello
I have noticed that when I create a capacity provider for an empty ECS cluster, and having the target capacity any value less than 100 ( I tested using 90% target capacity ), the Asg will keep creating 2 instances and deleting them, and stuck in this loop. However I'm expecting no instances are lunched as there is no tasks yet started.
but when I changed the target capacity to be 100, the ASG terminated the instances and didn't start new instances after that. when there are no tasks running and there are no ECS instances started, the capacity provider metric is 100%.