[Extensions] Improve maxSurge/maxUnavailable handling for multi-zone worker pools

timuthy commented 3 years ago

How to categorize this issue?

/area usability /kind enhancement /priority 3

What would you like to be added: Today, maxSurge and maxUnavailable values are configured at the worker pool level (ref). Provider extensions usually distribute the configured values if multiple multiple zones are configured (ref).

Although distributing these numbers is generally acceptable, it seems unclear to end-users and thus can end in an unacceptable and unexpected cluster upgrade behavior. This is especially true when maxSurge < len(zones) and maxSurge < len(zones) && maxUnavailable < maxSurge

Example:

    workers:
        name: worker
        machine:
          type: n1-standard-4
          image:
            name: gardenlinux
            version: 318.8.0
        maximum: 5
        minimum: 3
        maxSurge: 1
        maxUnavailable: 0
        zones:
        - europe-west1-a
        - europe-west1-b
        - europe-west1-c

This will result in 3 MachineDeployments:

MachineDeployment	Zone	maxSurge
worker-z1	europe-west1-a	1
worker-z2	europe-west1-b	0
worker-z3	europe-west1-b	0

While the workers in europe-west1-a are upgraded in a rolling fashion, the ones in europe-west1-b and europe-west1-c are just replaced. During the upgrade procedure, the cluster will have less Nodes then configured in workers[*].minimum.

We see the following options to improve this user experience (only when maxSurge < len(zones)):

Change API validation so that maxSurge >= len(zones) --> incompatible and will probably many automation functionalities around Gardener.
Automatically set maxSurge: 1 for each zone (suggested by @AxiomSamarth @himanshu-kun) --> solves many "standard" cases in which maxUnavailable is not used.
The worker actuator sets the configured values zone by zone when an upgrade is performed --> comes close to what is expected by end-users but implies long running worker reconciliations.
Other thoughts?

Why is this needed: Needed for better user experience to avoid unexpected outages.

vlerenc commented 3 years ago

Other thoughts?

@timuthy maxSurge = ceil(maxSurge / numZones) and maxUnavailable = ceil(maxUnavailable / numZones), i.e. a maxSurge of 1 with 3 zones, turns into 1/3, turns into 1, but it also generally works for larger numbers, though that may not be as important, if at all (but it is more reasonable and less magical).

The main point is that we neglected this problem from the start. It's not only about maxSurge/maxUnavailable. It's also about minimum/maximum. Also those should better use ceil() or, even better yet, the configuration should be "the same per zone". Just like with GKE, for instance. There, if you add a new pool, you specify how man nodes per zone are added (one number times the number of zones results in the overall number of required nodes, which is therefore always an integer multiple of the number per zone - while we on the other hand interpret the number for all zones and then divide by the number of zones and by that have to deal with the division remainder/leave the realm of integer numbers ;-)).

So the next thought is to give up on "hey, it's logical, everything in the configuration per worker pool is meant for the entirety of the machines" (the instance type, the OS, the volume, the container runtime, the taints/labels/annotations, the kubelet configuration). And to be honest, it isn't that "logical" either (which all adds to the end user confusion). For instance dataVolumes is not divided by the number of nodes/zones, but per node. So (unless we want to allow people to tweak the configuration even per zone; probably no use case for that unless in very narrow corner cases where some instance types are "out of stock" in particular zones), we could also have new dedicated properties, e.g. minMachinesPerZone, maxMachinesPerZone, maxSurgeMachinesPerZone, maxUnavailableMachinesPerZone which would finally end the ambiguity.

gardener-robot commented 2 years ago

@dguendisch, @mvladev, @ialidzhikov, @plkokanov, @timuthy, @vpnachev, @schrodit, @danielfoehrkn, @beckermax, @rfranzke, @timebertt, @hendrikkahl, @kris94, @voelzmo, @stoyanr This issue was referenced by @vlerenc in duplicate issue gardener/autoscaler#104.

gardener-ci-robot commented 2 years ago

The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed You can:
Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

gardener-ci-robot commented 2 years ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close

/lifecycle rotten

gardener-ci-robot commented 2 years ago

The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten

/close

gardener-prow[bot] commented 2 years ago

@gardener-ci-robot: Closing this issue.

In response to [this](https://github.com/gardener/machine-controller-manager/issues/798): >The Gardener project currently lacks enough active contributors to adequately respond to all issues and PRs. >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

unmarshall commented 2 years ago

/remove-lifecycle rotten

unmarshall commented 2 years ago

/reopen

gardener-prow[bot] commented 2 years ago

@unmarshall: Reopened this issue.

In response to [this](https://github.com/gardener/machine-controller-manager/issues/798): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

gardener-ci-robot commented 2 years ago

The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

unmarshall commented 2 years ago

/remove-lifecycle stale

unmarshall commented 2 years ago

/assign

gardener-ci-robot commented 1 year ago

The Gardener project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close

/lifecycle stale

himanshu-kun commented 1 year ago

GKE upgrade process -> https://cloud.google.com/kubernetes-engine/docs/concepts/node-pool-upgrade-strategies

They support blue-green upgrades, but we don't. They upgrade one zone at a time, with their parameters working on only zone, while our upgrade works in parallel over all zones, but our parameters are split over zones , and current issue was opened for one such case in splitting. Both approaches have their pros and cons. Also they default the configuration to maxSurge=1, maxUnavailable=0 for their worker pool , if not provided. One solution recommended (see issue description) suggested similar.

himanshu-kun commented 1 year ago

Pros of one zone at a time:

we ensure a failing rolling update doesn't impact all the zones together (staged zone rollout)
a pod de-scheduled in one zone due to rolling update won't go to another zone, which sometimes happen in parallel rolling update
there is a clear demarcation of when to move to a new zone and roll it, either on completion of the last one or if last one didn't progress for progressDeadlineSeconds
There is no need to distribute the maxSurge/maxUnavailable amongst all zones as it will apply only to the zone for which an update is in progress.

BeckerMax commented 1 year ago

The main point is that we neglected this problem from the start. It's not only about maxSurge/maxUnavailable. It's also about minimum/maximum. Also those should better use ceil() or, even better yet, the configuration should be "the same per zone".

How maximum is currently calculated can cause unexpected node termination through adding zones. For example if you have 12 workers in zone A and a maximum of 30 then adding two more zones means that the new maximum of zone A will be 10 workers. This will lead to node termination in zone A. This is something an operator/shoot owner can easily be surprised by and can impact workload. Therefore, I vote to change that. Ideas would be to specify the maximum per zone or let the maximum be the upper limit for the sum of the nodes in all zones.

timuthy commented 1 year ago

We see some potential to improve the API wrt to the worker configuration. I opened https://github.com/gardener/gardener/issues/8142 for further discussion.

ashwani2k commented 1 year ago

:warning: I feel the https://github.com/gardener/machine-controller-manager/issues/798#issuecomment-1599149933 from @BeckerMax needs a more urgent attention and should be tracked in a separate issue to expedite its handling vs the perfect design to achieve the ask of allowing per zone configuration.

I did a simple experiment to check and found that this will lead to rolling of machines/pods in the current setup and may go unnoticed if the current max can accommodate the additional machines required/zone.

Expand to the see the experiment and the findings.

1. Created a shoot with worker pool of machine type m5.large(2cpu/8gb) with min=2, max=3, maxSurge=1 2. Create a nginx-deployment with replicas=2 and cpuRequest=1000m and deployed on the shoot. 3. Checking the running instance we can see 3 nodes and 2 pods running. ```yaml ❯ kgno NAME STATUS ROLES AGE VERSION ip-10-180-10-43.eu-west-1.compute.internal Ready 12m v1.26.5 ip-10-180-14-176.eu-west-1.compute.internal Ready 10m v1.26.5 ip-10-180-23-35.eu-west-1.compute.internal Ready 12m v1.26.5 ❯ kgp -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-85996f8dbd-5bnt8 1/1 Running 0 89s 100.64.1.9 ip-10-180-10-43.eu-west-1.compute.internal nginx-deployment-85996f8dbd-nwwb2 1/1 Running 0 89s 100.64.2.8 ip-10-180-14-176.eu-west-1.compute.internal ``` 4. Edited the worker-pool by just adding 2 additional zones without changing the min/max/surge values. We can see that old running nodes are now replaced with new nodes are created in the 2 new zones and it also leads to the pods being moved to the new nodes. ```yaml ❯ kgno NAME STATUS ROLES AGE VERSION ip-10-180-131-9.eu-west-1.compute.internal Ready 49s v1.26.5 ip-10-180-14-176.eu-west-1.compute.internal Ready 32m v1.26.5 ip-10-180-88-220.eu-west-1.compute.internal Ready 89s v1.26.5 ❯ kgp -o wide #done after I started writing this comments so its running for sometime now. NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-deployment-85996f8dbd-4m2nv 1/1 Running 0 35m 100.64.4.2 ip-10-180-131-9.eu-west-1.compute.internal nginx-deployment-85996f8dbd-g8qrt 1/1 Running 0 37m 100.64.3.2 ip-10-180-88-220.eu-west-1.compute.internal ``` 5. In this case we had capacity with max machines able to handle the changes. However, before new zones were added if we had lesser capacity we could also end up with these nodes in pending state which is bad for the end user as their running workloads are now in pending status with no scheduling possible unless they reach out to operators for help. 6. The risk becomes more pertinent when running stateful workload as that might lead to downtime for the service.

I would propose:

If possible to detect the current active machines on a singleZone cluster
Validate that when making the worker-pool multi-zonal the new min value of perZone machineDeployment for the existing zone is not < then the current running machines.
If found false then if the max allows MCM should ensure that the min of the new per zone machineDeployments is >= the running machines in the existing zones.
otherwise if it can't be achieved in an automated way then we should either fail the shoot.yaml validation or fail the shoot reconciliation to ensure this change is not applied.

We can enhance the current dashboard warning by either proposing the min values or direct him of the consequences.

wdyt?

vlerenc commented 3 months ago

FYI: Proposal https://gist.github.tools.sap/D043832/f5d0ac0e0bb138eea07c84ac79e10ce9#step-1-enhancing-zone-distribution-and-surge-management (only step 1 relevant here)

gardener / machine-controller-manager

[Extensions] Improve maxSurge/maxUnavailable handling for multi-zone worker pools #798