Open hardikdr opened 4 years ago
cc @vlerenc @amshuman-kr @rfranzke @prashanth26
Using pause/unpause is one option. There could be others (e.g. slow down machine rollout by dynamically tweaking various timeouts configurations).
But I think the more important part is to have a decoupling between the component that detects the issue and the component that takes corrective action. The same component detecting and taking action has two limitations.
I think it is better to decouple the issue detection and action taking into separate components (as I mentioned in the yet-to-be-reopened https://github.com/gardener/gardener/pull/1797).
We might want to slow down/pause machine rollouts for more reasons than just evicted pods taking too long to come up elsewhere. Also, someone other than MCM (gardenlet for starters) might be interested in the fact that there are infra issues in a shoot/seed.
Thanks @hardikdr @amshuman-kr . As a quick remedy, would it make sense to already change the behaviour (from a fixed timeout) and check that at least the volume was detached before going on? Then again, @amshuman-kr haven't you reported (somewhere else), that you saw erratic behaviour, so it wasn't possible to even do that properly, right?
as I mentioned in the yet-to-be-reopened #1797 @amshuman-kr the link of not clickable, can you please point to the correct one.
Also, overall sounds good to have separate components to have detection and action-taker. Can you please elaborate a little more around where do you think they could be hosted later?
@hardikdr I fixed the link above.
As a quick remedy, would it make sense to already change the behaviour (from a fixed timeout) and check that at least the volume was detached before going on? Then again, @amshuman-kr haven't you reported (somewhere else), that you saw erratic behaviour, so it wasn't possible to even do that properly, right?
@vlerenc Please bear with me if I have already said this before. But there are two parts to the eviction and waiting.
terminationGracePeriodSeconds
as specified the pod itself.node.status.volumesAttached
a few times before going away for good. Perhaps, this is a race condition in the KCM/kublet. If so, we could think about contributing a fix. Or worse, it might be the behaviour at the infra API itself. Then no point in contributing and we can think of doing a few retries at fixed intervals to account for this behaviour. But for small number of disks per node (say 10-15?), I would expect drain to happen normally during a machine rollout already today, since there is a check on node.status.volumesAttached
already.@hardikdr @prashanth26 Apart from this, are there any conflicts in timeout values for drain and machine rollout?
@amshuman-kr All good, that's what I meant. I wrote my question before the out-of-band sync, I believe. As for the reason for this erratic behaviour. I tend to believe its not an IaaS issue (would surprise me), but an update problem/race condition in K8s. Anyhow, during the out-of-band sync, we discussed whether we can consider it detached only if the status detached was observed for a given length of time (e.g. 5s or whatever you have observed as "stable").
Would it then make sense to include this check earlier? As for a low number of volumes, we have seen that Azure copes not well/is very slow even with a few volumes. We ran into this practically whenever we touch an Azure cluster. Then again, it depends how often we do that/when.
/priority critical
See also https://github.com/gardener/gardener/issues/87 for a similar problem on Gardener level.
We should develop something like a scoring function that tells us whether we have destabilised a landscape or shoot with an update or not (beyond a certain threshold). Threshold, because we cannot expect all shoots/pods to come up again (what this means is not even clear), so there needs to be some fuzzyness. Like e.g. "the number of pods that are not running (pending, crash-looping, container-creating, etc.) hasn't gone up by more than 30% of the number of pods that were drained from the affected nodes". Only then we continue or else we brace ourselves (pause) and watch whether the situation improves. It's more complicated than that (pods are not guaranteed to come up in the same number or may fail for many other reasons), but I believe we need to develop this notion/formula. It's also what we as human operators do. We don't expect all shoots/pods to be running again after an update, but we watch the tendency and in case of issues we then intervene, pause, and analyse what's going on. Automation can't analyse, but can pause and alert.
Just cross-linking https://github.com/gardener/gardener/issues/87#issuecomment-728741920 here. It is just a limited point about how to propagate the information calculated by MCM because it might be relevant at higher levels (extensions, gardenlet).
Portions of this have been addressed through https://github.com/gardener/machine-controller-manager/issues/561 , but we still need to deal with scenarios where we end up with lot of unschedulable pods during rollout and would wish to dynamically slow down drain on noticing such case
What would you like to be added: During rolling-updates, we decrease the replicas of the old-machine-set and gradually increase the ones in the new-machine-set respecting
maxSurge
/maxUnavailable
. Currently, when the newer machines join, we don't wait till either of the following happens:This behavior can pose certain issues for infrastructures with slow volume-detachments, where in-flight workload may pile up with time. Considerable issues could be:
Probable solution: A probable solution to tackle such a situation could be to control the flow of machines being drained. MCM already supports a field
pause
, it allows us to pause the on-going rolling-update.We could think of a small sub-controller in MCM which does the following:
Pause
the machine-deployment if the count goes beyond a certain configurable number.Unpause
the machine-deployment when the count goes below the threshold number.Open for further discussions around possible ideas.