Allow all worker groups with minimum 0

himanshu-kun commented 1 year ago

How to categorize this issue?

/area auto-scaling /kind enhancement

What would you like to be added:

Allow customer to set min:0 for all worker groups. Currently atleast 1 worker group should have min > 0 for the system components to run.

Why is this needed:

When a customer creates hundreds of shoots , such that each shoot is with a single worker pool (lets say worker-1) which spans over 3 zones zone-1 , zone-2 , zone 3.

Since our current distribution logic of min over zones tries to keep maxSkew <=1 between the zones, so always zone-1 will be selected to place the node , if min specified by cluster is 1 .

After a point zone-1 runs out of instances of that particular type , and the forth-created clusters don't get Ready , ever. This is problematic to the customer.

This problem can be solved by delegating the responsibility of node distribution(post the min, max distribution for nodegroups by gardenlet) to cluster-autoscaler , as it is smart enough to backoff on out-of-quota node-groups and try new ones. Earlier this was not possible , as CA was deployed much later in the flow, but after this PR , CA is deployed way sooner, soon enough that it can even handle node scale-up for system-pods.

So after a combined discussion with @vlerenc , @unmarshall , @kon-angelo and the affected customers , the following combination should solve the above stated problem:

customer will create worker pools so that all have min:0
- then CA will take care of the distribution .

For more context refer:

canary issue 3546

Dependency:

https://github.com/gardener/autoscaler/issues/154 to be included in the same gardener/gardener release as this PR

unmarshall commented 1 year ago

The current distribution logic in general only statically distributes the min (defined at the worker pool) amongst the machine deployments. In the specific case where lots of small clusters (with min = 1) are created across 3 zones, the first zone picked up depends on the order in which the zones are specified when defining a shoot, thereby increasing the chances of quota exhaustion in the first zone. The customer in this case uses the same stencil to create all of their clusters and thus resulting in the same order in which zones are set.

Other alternatives that we discussed:

Alternative 1 We also discussed if the customer can randomise the order in which zones are specified but we soon found that this was also not optimal for the following reasons:

By asking the customers to do so, we leak implementation details on how the min nodes are distributed across zones.
Even if the customers randomise, it can still end up into the same issue eventually.

The issue is further exacerbated with the fact that cluster is not marked as ready till all the min machines are up and ready. So if the customer waits for this condition to start deploying their workloads then that condition will not be met if one of the zones selected for bringing up VMs is already out of quota. Once PR is released then the customers will no longer have to wait for all machines to be up and ready as CA will see the unscheduled pods and will backoff from one zone and scale another zone instead. However this would additionally require a way for the customer to know if at least some nodes are available in the cluster.

Alternative 2 To ensure quota availability customers could create a capacity reservation at the zone or region level. Then even with the current distribution logic their VMs will come up. However capacity reservation does not come free and the customers did not find it as a viable and cost-effective option.

As mentioned above by @himanshu-kun allowing min = 0, allows CA to take up the responsibility to choose which zone a VM can be launched. In case you have any better suggestions/alternatives then we would like to discuss them.

himanshu-kun commented 1 year ago

/assign @elankath

elankath commented 1 year ago

This backlog will be transformed into an epic with 3 stages:

Adjust shoot validation to support min-0 worker pools: This is a very simple adjustment to the validation so that worker pools with Autoscaler Minimum specified as zero are no longer rejected.
Support dynamic worker node group distribution: The current node group distribution across zones is currently statically computed for the autoscaler. A design proposal to support dynamic node group distribution will be presented and incorporated. This will allow adjustments of Min/Max for worker pools across zones. Requires changes and enhancements in both gardener, gardener extension providers and autoscaler cloud provider implementation. (Generated Deployment spec of autoscaler will also change to avoid use of --nodes flag). Shoot spec will also have some enhacements.
Support explicit worker node group specification: This is where the operator has very clear idea of the worker pool configuration for zone. This is covered in https://github.com/gardener/gardener/issues/8142

gardener-ci-robot commented 11 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Mark this issue as rotten with /lifecycle rotten
Close this issue with /close

/lifecycle stale

gardener-ci-robot commented 10 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close

/lifecycle rotten

gardener-ci-robot commented 9 months ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues. This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten

/close

gardener-prow[bot] commented 9 months ago

@gardener-ci-robot: Closing this issue.

In response to [this](https://github.com/gardener/gardener/issues/7857#issuecomment-1938586287): >The Gardener project currently lacks enough active contributors to adequately respond to all issues. >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

elankath commented 9 months ago

Addendum: Stage-1 was simply satisfied with: https://github.com/gardener/gardener/pull/8490

gardener / gardener

Allow all worker groups with minimum 0 #7857

Other alternatives that we discussed: