Controlled Eviction of Pods during Cluster Rollout

hardikdr commented 4 years ago

What would you like to be added: During rolling-updates, we decrease the replicas of the old-machine-set and gradually increase the ones in the new-machine-set respecting maxSurge/maxUnavailable. Currently, when the newer machines join, we don't wait till either of the following happens:

the older-machines are completely terminated and
the workload on them has been evicted and placed/running successfully on newer machines.

This behavior can pose certain issues for infrastructures with slow volume-detachments, where in-flight workload may pile up with time. Considerable issues could be:

Many critical pods being unavailable at the same time due to slow volume detachments.
May hit API-rate limits on cloud due to CCM constantly retrying the operations.

Probable solution: A probable solution to tackle such a situation could be to control the flow of machines being drained. MCM already supports a field pause, it allows us to pause the on-going rolling-update.

We could think of a small sub-controller in MCM which does the following:

Check the number of pods stuck in a certain condition, depicting slow infra-operation.
Pause the machine-deployment if the count goes beyond a certain configurable number.
Unpause the machine-deployment when the count goes below the threshold number.

Open for further discussions around possible ideas.

hardikdr commented 4 years ago

cc @vlerenc @amshuman-kr @rfranzke @prashanth26

amshuman-kr commented 4 years ago

Using pause/unpause is one option. There could be others (e.g. slow down machine rollout by dynamically tweaking various timeouts configurations).

But I think the more important part is to have a decoupling between the component that detects the issue and the component that takes corrective action. The same component detecting and taking action has two limitations.

For a given issue/condition It restricts us to only a few fixed set of actions being taken as hard-coded into the component. Some other components cannot take any actions unless they also detect the issue themselves.
For a given set of fixed actions, the set of issues/conditions that trigger the actions are also fixed/hard-coded into the component.

I think it is better to decouple the issue detection and action taking into separate components (as I mentioned in the yet-to-be-reopened https://github.com/gardener/gardener/pull/1797).

We might want to slow down/pause machine rollouts for more reasons than just evicted pods taking too long to come up elsewhere. Also, someone other than MCM (gardenlet for starters) might be interested in the fact that there are infra issues in a shoot/seed.

vlerenc commented 4 years ago

Thanks @hardikdr @amshuman-kr . As a quick remedy, would it make sense to already change the behaviour (from a fixed timeout) and check that at least the volume was detached before going on? Then again, @amshuman-kr haven't you reported (somewhere else), that you saw erratic behaviour, so it wasn't possible to even do that properly, right?

hardikdr commented 4 years ago

as I mentioned in the yet-to-be-reopened #1797 @amshuman-kr the link of not clickable, can you please point to the correct one.

Also, overall sounds good to have separate components to have detection and action-taker. Can you please elaborate a little more around where do you think they could be hosted later?

amshuman-kr commented 4 years ago

@hardikdr I fixed the link above.

amshuman-kr commented 4 years ago

As a quick remedy, would it make sense to already change the behaviour (from a fixed timeout) and check that at least the volume was detached before going on? Then again, @amshuman-kr haven't you reported (somewhere else), that you saw erratic behaviour, so it wasn't possible to even do that properly, right?

@vlerenc Please bear with me if I have already said this before. But there are two parts to the eviction and waiting.

Pod eviction itself and waiting for the pod to terminate. This is already handled and MCM waits for the terminationGracePeriodSeconds as specified the pod itself.
Waiting for the volume to be detached. This is problematic. As reported in #468, the volume being detached goes in and out of node.status.volumesAttached a few times before going away for good. Perhaps, this is a race condition in the KCM/kublet. If so, we could think about contributing a fix. Or worse, it might be the behaviour at the infra API itself. Then no point in contributing and we can think of doing a few retries at fixed intervals to account for this behaviour. But for small number of disks per node (say 10-15?), I would expect drain to happen normally during a machine rollout already today, since there is a check on node.status.volumesAttached already.

@hardikdr @prashanth26 Apart from this, are there any conflicts in timeout values for drain and machine rollout?

vlerenc commented 4 years ago

@amshuman-kr All good, that's what I meant. I wrote my question before the out-of-band sync, I believe. As for the reason for this erratic behaviour. I tend to believe its not an IaaS issue (would surprise me), but an update problem/race condition in K8s. Anyhow, during the out-of-band sync, we discussed whether we can consider it detached only if the status detached was observed for a given length of time (e.g. 5s or whatever you have observed as "stable").

Would it then make sense to include this check earlier? As for a low number of volumes, we have seen that Azure copes not well/is very slow even with a few volumes. We ran into this practically whenever we touch an Azure cluster. Then again, it depends how often we do that/when.

hardikdr commented 3 years ago

/priority critical

vlerenc commented 3 years ago

See also https://github.com/gardener/gardener/issues/87 for a similar problem on Gardener level.

We should develop something like a scoring function that tells us whether we have destabilised a landscape or shoot with an update or not (beyond a certain threshold). Threshold, because we cannot expect all shoots/pods to come up again (what this means is not even clear), so there needs to be some fuzzyness. Like e.g. "the number of pods that are not running (pending, crash-looping, container-creating, etc.) hasn't gone up by more than 30% of the number of pods that were drained from the affected nodes". Only then we continue or else we brace ourselves (pause) and watch whether the situation improves. It's more complicated than that (pods are not guaranteed to come up in the same number or may fail for many other reasons), but I believe we need to develop this notion/formula. It's also what we as human operators do. We don't expect all shoots/pods to be running again after an update, but we watch the tendency and in case of issues we then intervene, pause, and analyse what's going on. Automation can't analyse, but can pause and alert.

amshuman-kr commented 3 years ago

Just cross-linking https://github.com/gardener/gardener/issues/87#issuecomment-728741920 here. It is just a limited point about how to propagate the information calculated by MCM because it might be relevant at higher levels (extensions, gardenlet).

himanshu-kun commented 1 year ago

Summary of grooming discussion

Portions of this have been addressed through https://github.com/gardener/machine-controller-manager/issues/561 , but we still need to deal with scenarios where we end up with lot of unschedulable pods during rollout and would wish to dynamically slow down drain on noticing such case

gardener / machine-controller-manager

Controlled Eviction of Pods during Cluster Rollout #470

Summary of grooming discussion