Open shreyas-s-rao opened 1 year ago
There is an issue when using onDelete
which should be kept in mind.
https://github.com/kubernetes/kubernetes/issues/73492
Consider a 3-member etcd cluster etcd-main, with pods etcd-main-0, etcd-main-1 and etcd-main-2 running, in the pink of health. At this point, if etcd-main-0 (or etcd-main-1) becomes unhealthy due to multiple reasons (network connectivity issues, zone outages, node failure, or simply an etcd issue which might be resolvable by restarting the pod), then the etcd cluster still maintains quorum with the other two healthy members, but is now only one step away from losing quorum. What happens now if there's an update to the etcd statefulset spec, like a change in etcd-backup-restore image version or a configuration change to the etcd or etcdbrctl processes? The statefulset controller starts rolling the pods starting with etcd-main-2. As soon as it deletes this pod to make room for the updated pod, the cluster loses quorum. This leads to a downtime of the etcd, subsequently causing a downtime to the kube-apiserver whom the etcd is backing, until the updated etcd-main-2 pod comes back up.
this scenario can also be avoided by setting spec.updateStrategy.rollingUpdate.maxUnavailable to 1
but this feature is only available via feature gate of api-server and it's still alpha feature.
I was just wondering, can we use this feature till etcd-druid moved to onDelete strategy
?
this scenario can also be avoided by setting spec.updateStrategy.rollingUpdate.maxUnavailable to 1 but this feature is only available via feature gate of api-server and it's still alpha feature. I was just wondering, can we use this feature till etcd-druid moved to onDelete strategy ?
We decided that as it's an alpha feature we won't be using this and will be directly moving to OnDelete
strategy.
Rough Discussion notes: On-Delete strategy for stateful set.md
After an offline meeting with @ashwani2k , @ishan16696 , and @renormalize , we discussed three scenarios related to the OnDelete strategy:
If we set cluster-autoscaler.kubernetes.io/safe-to-evict: "false", consider a scenario where the Vertical Pod Autoscaler (VPA) is evicting a pod during voluntary disruptions. Simultaneously, there may be no unhealthy pods, compelling us to select a candidate and trigger pod deletion to apply the latest updates. This scenario could potentially lead to transient quorum loss because Pod Disruption Budgets (PDBs) are not respected due to direct deletion calls made by the OnDelete pod updater.
Suppose etcd reconciliation and node reconciliation (rolling update of a node) occur concurrently, leading to a node drain attempting to evict a pod. Simultaneously, the OnDelete update component might also select this candidate and trigger deletion. This action can cause transient quorum loss because PDBs are not respected by the deletion calls from the OnDelete pod updater.
Consider a scenario where, during etcd reconciliation, we select a candidate and trigger pod deletion. At the same time, due to high utilization of an already running pod, the kubelet may initiate a node-pressure eviction, an involuntary disruption that does not respect PDBs.
From our brainstorming session, we concluded that we should use the eviction API whenever we are deleting a healthy pod during voluntary disruptions. This approach ensures that any simultaneous involuntary disruptions can be mitigated by PDBs.
Evict Healthy Pods: Use the eviction API to manage healthy pods, respecting PDBs and preventing disruptions. Delete Unhealthy Pods: Directly delete unhealthy pods when necessary, as they are not necessarily protected by PDBs.
By adopting this strategy, we safeguard against scenarios 1 and 2 and, to some extent, can also prevent scenario 3 if node-pressure eviction occurs before our deletion process begins. This method ensures greater stability and reliability in managing our Kubernetes resources.
For further understanding of Voluntary and Involuntary disruptions, you can read more here.
If PR #855 is merged, we can fully utilize eviction for managing pods
Feature (What you would like to be added): Druid-controlled updates to the pods in the etcd cluster.
Motivation (Why is this needed?): Currently, druid deploys the etcd cluster as a statefulset with number of replicas set to the desired number of members in the etcd cluster. The
spec.updateStrategy
of this statefulset is set toRollingUpdate
, which allows the statefulset controller to roll the etcd pods one after the other in a rolling fashion. The order of updating each of pod is deterministic - from largest ordinal to the smallest, as per the documentation. This works fine for a perfectly healthy etcd cluster, but poses a risk for a multi-node etcd cluster with an unhealthy pod.Consider a 3-member etcd cluster
etcd-main
, with podsetcd-main-0
,etcd-main-1
andetcd-main-2
running, in the pink of health. At this point, ifetcd-main-0
(oretcd-main-1
) becomes unhealthy due to multiple reasons (network connectivity issues, zone outages, node failure, or simply an etcd issue which might be resolvable by restarting the pod), then the etcd cluster still maintains quorum with the other two healthy members, but is now only one step away from losing quorum. What happens now if there's an update to the etcd statefulset spec, like a change inetcd-backup-restore
image version or a configuration change to theetcd
oretcdbrctl
processes? The statefulset controller starts rolling the pods starting withetcd-main-2
. As soon as it deletes this pod to make room for the updated pod, the cluster loses quorum. This leads to a downtime of the etcd, subsequently causing a downtime to the kube-apiserver whom the etcd is backing, until the updatedetcd-main-2
pod comes back up.This is an artificially introduced quorum loss scenario, which can be entirely avoided if druid takes control of the order of updation of the etcd pods.
Approach/Hint to the implement solution (optional):
Setting the etcd statefulset's
spec.updateStrategy
toOnDelete
essentially disables automatic rollouts of pods upon statefulset spec updates, and instead tells the statefulset controller to wait until a pod is deleted before restarting it with the updated pod spec. This provides druid the freedom to check which pods are healthy and which are not, and take a careful decision on the order of pod updation. In the above case whereetcd-main-0
became unhealthy, druid can first updateetcd-main-0
to ensure that quorum is still maintained by the other two other members. The pod spec update can potentially fix any problem withetcd-main-0
such as an internal error or by rescheduling it to a different node which might not be suffereing from the same network connectivity issues. Druid can then proceed with the updation of the rest of the etcd pods. In essence, this method reduces the likelihood of a artificially indiced quorum loss caused by a badly ordered updation of etcd pods in the cluster.Changing the updateStrategy of the statefulset to
OnDelete
is also beneficial in the case of rolling the volumes backing the etcd pods, as explained by @unmarshall in https://github.com/gardener/gardener-extension-provider-aws/issues/646 and further discussed in https://github.com/gardener/etcd-druid/issues/481.