[EKS] [request]: Managed Nodes scale to 0

mikestef9 commented 4 years ago

Currently, managed node groups has a required minimum of 1 node in a node group. This request is to update behavior to support node groups of size 0, to unlock batch and ML use cases.

https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0

mathewpower commented 4 years ago

This feature would be great for me. I'm looking to run GitLab workers on my EKS cluster to run ML training workloads. Typically, these jobs only run for a couple of hours a day (on big instances) so being able to scale down would make thing much more cost effective for us.

Any ideas when this feature might land?

jzjones-lc commented 4 years ago

@mathewpower you might want to use a vanilla autoscaling group instead of EKS managed.

Pretty much this issue makes EKS managed nodes a nonstarter for any ML projects due to one node in each group always being on

jcampbell05 commented 4 years ago

There is tasks now - perhaps that's the solution for this.

jzjones-lc commented 4 years ago

@jcampbell05 can you elaborate? What tasks are you referring to?

yann-soubeyrand commented 4 years ago

I guess that node taints will have to be managed like node labels already are in order for the necessary node template to be set: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/cloudprovider/aws/README.md#scaling-a-node-group-to-0.

mikestef9 commented 4 years ago

Hey @yann-soubeyrand that is correct. Looking for some feedback on that, would you want all labels and taints to automatically propagate to the ASG in the required format for scale to 0, or have selective control over which ones propagate?

dcherman commented 4 years ago

@mikestef9 If AWS has enough information to propagate the labels/taints to the ASG, then I think it'd be preferable to have it "just work" as much as possible.

There will still be scenarios where manual intervention will be needed by the consumer I think such as setting region/AZ labels for single AZ nodegroups so that cluster-autoscaler can make intelligent decisions if a specific AZ is needed, however we should probably try to minimize that work as much as possible.

yann-soubeyrand commented 4 years ago

@mikestef9 in my understanding, all the labels and taints should be propagated to the ASG in the k8s.io/cluster-autoscaler/node-template/[label|taint]/<key> format since the cluster autoscaler takes its decisions based on it. If some taints or labels are missing, this could mislead the cluster autoscaler. Also, I'm not aware of any good reason not to propagate certain labels or taints.

A feature which could be useful though, is to be able to disable cluster autoscaler for specific node groups (that is, not setting k8s.io/cluster-autoscaler/enabled tag on these node groups).

@dcherman isn't the AZ case already managed by cluster autoscaler without specifying label templates?

dcherman commented 4 years ago

@yann-soubeyrand I think you're right! Just read through the cluster-autoscaler code, and it looks like it discovers what AZs the ASG creates nodes in from the ASG itself; I always thought it had discovered those from the nodes initially created by the ASG.

In that case, we can disregard my earlier comment.

Ghazgkull commented 4 years ago

I would like to be able to forcibly scale a managed node group to 0 via the CLI, by setting something like desired or maximum number of nodes to 0. Ignoring things like pod disruption budgets, etc.

I would like this in order for developers to have their own clusters which get scaled to 0 outside of working hours. I would like to use a simple cron to force clusters to size 0 at night, then give them 1 node in the morning and let cluster-autoscaler scale them back up.

sibendu commented 4 years ago

Hi All is this feature already for AWS EKS? From following documentation it appears EKS supports it - From CA 0.6 for GCE/GKE and CA 0.6.1 for AWS, it is possible to scale a node group to 0 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0 Can someone please confirm?

yann-soubeyrand commented 4 years ago

Hi All is this feature already for AWS EKS? From following documentation it appears EKS supports it - From CA 0.6 for GCE/GKE and CA 0.6.1 for AWS, it is possible to scale a node group to 0 https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#how-can-i-scale-a-node-group-to-0 Can someone please confirm?

@sibendu it's not supported with managed node groups yet (this is the object of this issue) but you can achieve it with non managed node groups (following the documentation you linked).

cfarrend commented 4 years ago

Would be great to have this, we make use of cluster autoscaling in order to demand GPU nodes on GKE and scale down when there are no requests. Having one node idle is definitely not cost effective for us if we want to use managed nodes on EKS

antonosmond commented 4 years ago

Putting use cases aside (although I have many), autoscaling groups already support min, max & desired size being 0. A node group is ultimately just an autoscaling group (and therefore already supports size 0). You can go into the AWS web console, find the ASG created for a node group and set the size to 0 and it's fine therefore it doesn't make sense that node groups are not supporting a zero size. As a loyal AWS customer it's frustrating to see things like this - there appears to be no good technical reason for preventing a size of zero but forcing customers to have a least 1 instance makes AWS more £££. Hmmm... was the decision to prevent a zero size about making it better for the customer or is Jeff a bit short of cash?

yann-soubeyrand commented 4 years ago

@antonosmond there are good technical reasons why you cannot scale from 0 with the actual configuration: for the autoscaler to be able to scale from 0, one have to put tags on the ASG indicating labels and taints the nodes will have. These tags are missing as of now. This is the purpose of this issue.

antonosmond commented 4 years ago

@yann-soubeyrand The cluster autoscaler is just one use case but this issue shouldn't relate specifically to the cluster autoscaler. The issue should be that you can't set a size of zero and regardless of use case or whether or not you run the cluster autoscaler, you should be able to set a size of zero as this is supported in autoscaling groups.

In addition to the use cases above, other use cases for 0 size include:

PoCs and testing (I may want 0 nodes so I can test my config without incurring instance charges)
having different node groups for different instance types where I don't necessarily need all instance types running at all times
cost saving e.g. scaling to zero overnight / at weekends

yann-soubeyrand commented 4 years ago

@antonosmond if you're not using cluster autoscaler, you're scaling the ASG manually, right? What prevents you from setting a min and desired count to 0? It seems to work as intended.

antonosmond commented 4 years ago

@yann-soubeyrand I got to this issue from here. It's nothing to do with the cluster autoscaler, I simply want to create a node group with an initial size of 0. I have some terraform to create a node group but if I set the size to 0 it fails because the AWS API behind the resource creation validates that size is greater than zero. Update - and yes I can create a node group with a size of 1 and then manually scale it zero but I shouldn't need to. The API should allow me to create a node group with a zero size.

yann-soubeyrand commented 4 years ago

The API should allow me to create a node group with a zero size.

I think we all agree with this ;-)

MatteoMori commented 4 years ago

Hey guys,

is there any update on this one?

thanks!

tomaspinho commented 4 years ago

Not having this makes it exceptionally hard migrating from one Node Group to another (we are in the process of moving to our own launch templates) without fearing breaking everything without a good rollback procedure.

tlh2857 commented 4 years ago

I agree, this would be a great feature. Having to drain + change ASG desiredInstanceCount is tedious. I have an infrequently accessed applicaiton running on EKS that I spin up when needed, but don't need it to sit idle at 1 instnace even when not being used. Any update on timeline?

abatilo commented 3 years ago

Looking to see if there's any update here?

I believe this is preventing me from having multiple different instance types across multiple different node groups. If I want to have a node group for each size of m5s, now I have to have at least 1 running for each as well even if it's unlikely that I need the 2xl or 4xl.

brettmorien commented 3 years ago

Adding some noise here. Spot instances was one of our hurdles (thanks for delivering!), but we are holding off on moving to managed node groups until we can be assured we won't have large, idle nodes for the sporadic bigger workloads. Any updates here would be helpful.

calebschoepp commented 3 years ago

Yep, this is a bummer and certainly makes migrating from ASGs to managed node groups much less appealing. +1 for this feature.

Update - and yes I can create a node group with a size of 1 and then manually scale it zero but I shouldn't need to. The API should allow me to create a node group with a zero size.

When you say manually scale it to zero, do you mean literally change the desired value of the ASG after the fact? Is that permanent - at least until you re-deploy the infra? @antonosmond

dcherman commented 3 years ago

@calebschoepp It seems to be so far, including after running upgrades on the nodegroups. I actually do this using a local-exec provisioner in Terraform after creation of the nodegroup for the ones that I want scale to 0 on.

acesir commented 3 years ago

We make heavy use of GPU nodes in EKS with Jupyter Notebooks that will autoscale based on requests and prune notebooks after inactivity. It makes it impossible for us to migrate to Managed Nodes as GPU instances are so expensive and we need one always on. Hoping this gets released sooner than later 👍

aSapien commented 3 years ago

I have unfrequent bursts of heavy workloads requiring many resources. It doesn't make sense to keep a machine running at all times. Please make this happen!

ma3mool commented 3 years ago

Almost a year since this issue has been brought up. Any update or timeline on when we might have it?

Thank you!

HTMLGuyLLC commented 3 years ago

I want to downsize dev node groups when no work is actively being done. Please add this feature!

mikestef9 commented 3 years ago

One question we have here as we are working on this feature - we see two options when you create a node group

Allow desired size and min size to be 0.
Min size can be 0, but desired size still has a minimum of 1 (and can be scaled to 0 desired size after initial creation).

We are leaning towards option 2, as we feel it's better for any node group misconfiguration issue that may cause a node not join the cluster to be identified up front, but please let us know if you have use case for desired size to be also set to 0 as part of the node group creation.

b2cbre commented 3 years ago

One question we have here as we are working on this feature - we see two options when you create a node group

Allow desired size and min size to be 0.

Min size can be 0, but desired size still has a minimum of 1 (and can be scaled to 0 desired size after initial creation).

We are leaning towards option 2, as we feel it's better for any node group misconfiguration issue that may cause a node not join the cluster to be identified up front, but please let us know if you have use case for desired size to be also set to 0 as part of the node group creation.

Option 2 is logical and helpful.

ahoward-conga commented 3 years ago

One question we have here as we are working on this feature - we see two options when you create a node group

Allow desired size and min size to be 0.

Min size can be 0, but desired size still has a minimum of 1 (and can be scaled to 0 desired size after initial creation).

We are leaning towards option 2, as we feel it's better for any node group misconfiguration issue that may cause a node not join the cluster to be identified up front, but please let us know if you have use case for desired size to be also set to 0 as part of the node group creation.

@mikestef9 I think it would be better to go with Option 1.

Option 2 might necessitate doing two updates via IaC tools, one to set the initial state and one to then immediately overwrite that state by setting the desired back to 0.

I think letting the users be always explicit about the minimum and desired size allows for more flexibility in configuration.

I'm welcome to hear any other thoughts/use cases!

dcherman commented 3 years ago

I'm fine with either of those decisions. For nodegroups where you want to scale to 0, it's highly likely that you're using cluster-autoscaler or another autoscaler to manage the desired size, so the 10-15min that a node would exist before being destroyed is not a dealbreaker imo if it makes identifying misconfigured/unhealthy nodegroups easier.

kreempuff commented 3 years ago

Echoing @HTMLGuyLLC and @acesir

The GPU use case is something I'm currently using and the dev workloads is something I'm planning.

In both of these cases, having a desired count of zero to allow AutoScaling to control the desired count would be ideal.

acesir commented 3 years ago

One question we have here as we are working on this feature - we see two options when you create a node group

Allow desired size and min size to be 0.

Min size can be 0, but desired size still has a minimum of 1 (and can be scaled to 0 desired size after initial creation).

We are leaning towards option 2, as we feel it's better for any node group misconfiguration issue that may cause a node not join the cluster to be identified up front, but please let us know if you have use case for desired size to be also set to 0 as part of the node group creation.

My personal preference would be option 1 as it doesn't force additional calls in order to downscale after cluster creation. Having said that, either option would work at this point for a much needed feature like this.

tomaspinho commented 3 years ago

+1 for option 1, Terraform users will be much happier. It would be ok to make option 2 the default for dashboard interactions.

ma3mool commented 3 years ago

Either option would work, but also have a preference for option 1.

tlh2857 commented 3 years ago

Prefer option 1 :)

On Mon, Jan 25, 2021 at 2:01 PM Abdulrahman Mahmoud < notifications@github.com> wrote:

Either option would work, but also have a preference for option 1.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/aws/containers-roadmap/issues/724#issuecomment-767077414, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL3BU7D2I3A36LD4SOVOWGDS3XE25ANCNFSM4KLUEMDQ .

HTMLGuyLLC commented 3 years ago

Honestly, I tried option 1 before reading and seeing that wasn't possible lol. I vote for option 1 as well.

Tried to scale to 0 and it errored about minSize so I tried to set that to zero as well (option didn't exist). Whoops.

Ghazgkull commented 3 years ago

I also lean toward Option 1, because of IAC concerns. But realistically, using Terraform and the EKS TF provider, we only set min and max sizes at creation time; we're not setting a static desired cluster size in our IAC. We just leave sizing up to cluster autoscaler and would only set desired size to 0 when we want to scale a deployed cluster down (e.g. in a nightly cron). So option 2 might also be fine?

matti commented 3 years ago

Option 2 would mean that you unnecessarily start an instance like c5a.24xlarge before it scales down.

GKE made the unfortunate decission to start minimum nodes with the first node pool. As well as scaleway.

Please don't do it again.

groodt commented 3 years ago

Thanks for working on this. I have a strong preference for Option 2, but would accept either option.

My opinion to justify Option 2:

As a cluster operator myself, it is much better to address misconfiguration issues early (fail fast). A person creating a nodegroup (of any size) will always be creating a nodegroup under the assumption that it will eventually have a size > 0. Better to test that this assumption is true, rather than be caught out later when something should scale up but doesn't due to misconfiguration.
In terms of IAC or Terraform picking up changes or requiring extra API calls. I don't think this is an issue. It is common to have this behaviour already with ECS Services (Terraform aws_ecs_service) or ASGs (Terraform aws_autoscaling_group) where the Terraform lifecycle is used to ignore changes to desired counts. Essentially as follows:
```
lifecycle {
ignore_changes = [desired_count]
}
```
In terms of extra API calls to scale down after creation, I think almost anyone using this feature will be using Kubernetes cluster-autoscaler. I believe that AWS is working to make this very common add-on one of the addons available with EKS so I believe that they will have good support for this out of the box.
For any use-case where a nodegroup size of 0 is desired, it would likely be an environment where the operator is managing adhoc / ephemeral / expensive workloads that are not required to have capacity or be responsive 24/7 (e.g. notebook environments, batch jobs, gpu jobs, ML training, developer environments etc). In these scenarios, I would setup monitoring to ensure that the nodegroup is indeed scaling down to zero at some stage in the day or through a similar metric to ensure that the nodegroup is not sitting idle for any length of time.

Saying all of the above, perhaps there is an Option 3?

What if users were allowed to specify all 3 values (min, max, desired)? Where: min <= desired <= max.

A user wishing to ensure that no instances are created at all until requested (through cluster-autoscaler or other) could set the values:

min = 0
desired = 0
max = N
lifecycle {
  ignore_changes = [desired]
}

A user wishing to ensure that the nodegroup can join a cluster and can correctly scale up beyond 0 and then scale back down (through cluster-autoscaler or other) could set the values:

min = 0
desired = 1
max = N
lifecycle {
  ignore_changes = [desired]
}

So, if the question is around whether or not the default value for desired should be 0 or 1, I think the default value should be 1 (to avoid the misconfiguration issues mentioned above and fail-fast).

However, this still provides the option for operators to set the desired count to 0 at creation time for those who are certain they do not have a misconfiguration and who are absolutely certain they do not want a node to boot until requested. To me, that gives defaults that are operator "safe" but still gives power-users the option to opt-out of the safety if they desire.

Of course, I can understand how people might perceive this as "costing money" and would prefer the default to be 0.

I can accept that as long as it's possible to specify all 3 values, then I can opt-in to a "safe" desired count of 1 when a new nodegroup is created and the cluster-autoscaler can scale it down for me.

brettmorien commented 3 years ago

Our scenario is to create EKS clusters for other teams using Terraform, allowing them to choose instance types available to their workloads and managed by cluster autoscalers. There can be a significant number of groups.

A desired count of 0 is a valid scenario, and an initial count of 0 is as well. It strikes me odd I'd have to work around something for my own protection using a totally different mechanism for something that's not dangerous, only possibly perplexing. I'm not sure what that mechanism is... iterating through values making curl calls against the API once the Terraform is done, and then making sure the values are ignored?

The code should do what I ask it and the interface should communicate the rules. Option 1.

armsnyder commented 3 years ago

I prefer option 1 since my team creates ASGs for every possible instance type that might be required by a workload in the cluster, and we let the cluster-autoscaler scale up the more expensive types only when they are needed. We use Terraform from automation for provisioning, and our pipeline expects that we can have our desired state using a single terraform apply operation.

If there is concern about users shooting themselves in the foot, you might add an additional flag like --force-desired-zero to require them to acknowledge what they're doing.

Option 2 might be alright if the terraform-provider-aws decides to add support for option 1, though is unusual for a Terraform provider to have a different API from the upstream API.

TBBle commented 3 years ago

I prefer Option 1. In our rollout, we have many sets of identical node groups spread across AZs, e.g., due to EBS-CSI and Cluster Autoscaler interactions, and even if are going to run an instance to validate, we would be validating one of those identical node-group in each set, not all of them.

And we'd validate it by throwing load at it, since we're validating the scale-from-zero case, ensuring that we have the AWS tags for correct Cluster Autoscaler operation, and that won't be tested if we started with a node already running.

talonx commented 3 years ago

+1 for option 1. It's more flexible.

HueponiK commented 3 years ago

What a strange logic for number two.

Would you apply logic like this to normal autoscaling groups? It doesn't make much sense here either.

pre commented 3 years ago

If an operator wishes to go with Option 2, they can do it if Option 1 was the behaviour.

If Option 1 is not the behaviour, users are always locked into Option 2.

Option 1 allows more flexibility as it should be up-to the cluster operator to decide which way to go. With the large or specialized instance types the concern of unnecessary extra costs is a real issue which costs real money.

matti commented 3 years ago

Money is not an issue in this economy. Let's go with the option 2.

aws / containers-roadmap

[EKS] [request]: Managed Nodes scale to 0 #724