Open timuthy opened 1 year ago
Any opinion @gardener/mcm-maintainers?
Hi Tim, we can support this though I am doubtful whether we should make it configurable in the shoot YAML as it can possibly lead to severe degradation if operator configures a high value and then a fair amount of effort diagnosing/trouble-shooting such issues.
However, I think we can introduce a fixed degree of parallelism in evicting Pods with PVs after relevant testing of the behaviour on problematic providers like Azure. Now that we have implemented https://github.com/gardener/machine-controller-manager/issues/781, today we wait for all volumes to be detached from the Node before proceeding to VM deletion. Hence those edge cases where still attached volumes cause the attach/detach controller to go into timeouts, is ameliorated.
Thanks for the feedback @elankath. It wasn't meant to be an option for shoot owners. The degree of parallelism can also be configured by Gardenlet via its config.
That fine. A "hidden knob" like a CLI option parallel-eviction-limit=X
should be OK.
How to categorize this issue?
/area performance /kind enhancement /priority 3
What would you like to be added: MCM should provide a knob to configure the degree of parallel evictions for pods with volumes.
Why is this needed:
262 established a serial eviction of pods with volumes to make the overall node drain process faster, esp. for cloud providers where many parallel detach/attach operations lead to rate limits and huge back-offs.
On some infrastructures and to some degree, a parallel eviction for pods with volumes may lead to a beneficial performance boost. Today, shoot clusters with many nodes often need a considerable amount of time to perform rolling updates. We see this aspect being one of the root causes that can be improved.