Feature (What you would like to be added):
For the time being, we collect our findings about unstable control planes and what we need to do about them here (only a few cover HVPA directly):
[ ] VPA has many, many shortcoming, e.g. it is unable to deal properly with spikes or recommends requests below current usage, which is one of the main reasons we cannot scale to larger clusters, see https://github.com/gardener/autoscaler/issues/47
[ ] HVPA will, once one OOMKilled pod triggers new VPA recommendations, roll that out to all replicas, recreating all of them and thereby terminating all connections and putting stress on the system to reinitialise (e.g. taking down ETCD)
[ ] We use HVPA to mitigate glaring issues with VPA and also for horizontal and vertical pod auto-scaling on the same metric, but possibly we should switch to request-based horizontal autoscaling once we have improved VPA and can drop HVPA completely
[ ] Once a large cluster control plane fails, it cannot recover by itself anymore as the components restart in a vicious cycle and nodes need to be onboarded in a controlled way for which standard Kubernetes provides no solution yet (batched/staged node onboarding to not overload the starting control plane again and again)
[ ] Clustered ETCD is required to make the cluster more resilient and don't let it die in a downward-spiral should we update ETCD or something happens to the instance we run
[ ] Core DNS is not stable, we see unbalanced load patterns that we must address that by means of node local DNS or better vertical pod autoscaling (as horizontal pod autoscaling is pretty much pointless)
[ ] Calico Typha is recommended to be used together with the cluster-proportional auto-scaler, but that's more a community bandaid as it only scales based on the number of nodes, whatever the size/load, so it boils down again to a better VPA to get that problem under control
[ ] Our monitoring/logging stacks have a fixed size (also to control the costs), but while we do not want to "pay" for excessive logging, the sizing should be more reasonable and match the basic cluster needs for the control plane and the Kubelets
Motivation (Why is this needed?):
Stable control plane, even if spikes or load tests stress it
Please plan for a public blog article that desribes this unique feature of Gardener on a high-level, such that it can be used to attract interest and establish thought leadership (internal as well as external)
Feature (What you would like to be added): For the time being, we collect our findings about unstable control planes and what we need to do about them here (only a few cover HVPA directly):
OOMKilled
pod triggers new VPA recommendations, roll that out to all replicas, recreating all of them and thereby terminating all connections and putting stress on the system to reinitialise (e.g. taking down ETCD)Motivation (Why is this needed?):