[Feature] Druid-controlled updates to the pods in the etcd cluster

shreyas-s-rao commented 1 year ago

Feature (What you would like to be added): Druid-controlled updates to the pods in the etcd cluster.

Motivation (Why is this needed?): Currently, druid deploys the etcd cluster as a statefulset with number of replicas set to the desired number of members in the etcd cluster. The spec.updateStrategy of this statefulset is set to RollingUpdate, which allows the statefulset controller to roll the etcd pods one after the other in a rolling fashion. The order of updating each of pod is deterministic - from largest ordinal to the smallest, as per the documentation. This works fine for a perfectly healthy etcd cluster, but poses a risk for a multi-node etcd cluster with an unhealthy pod.

Consider a 3-member etcd cluster etcd-main, with pods etcd-main-0, etcd-main-1 and etcd-main-2 running, in the pink of health. At this point, if etcd-main-0 (or etcd-main-1) becomes unhealthy due to multiple reasons (network connectivity issues, zone outages, node failure, or simply an etcd issue which might be resolvable by restarting the pod), then the etcd cluster still maintains quorum with the other two healthy members, but is now only one step away from losing quorum. What happens now if there's an update to the etcd statefulset spec, like a change in etcd-backup-restore image version or a configuration change to the etcd or etcdbrctl processes? The statefulset controller starts rolling the pods starting with etcd-main-2. As soon as it deletes this pod to make room for the updated pod, the cluster loses quorum. This leads to a downtime of the etcd, subsequently causing a downtime to the kube-apiserver whom the etcd is backing, until the updated etcd-main-2 pod comes back up.

This is an artificially introduced quorum loss scenario, which can be entirely avoided if druid takes control of the order of updation of the etcd pods.

Approach/Hint to the implement solution (optional):

Setting the etcd statefulset's spec.updateStrategy to OnDelete essentially disables automatic rollouts of pods upon statefulset spec updates, and instead tells the statefulset controller to wait until a pod is deleted before restarting it with the updated pod spec. This provides druid the freedom to check which pods are healthy and which are not, and take a careful decision on the order of pod updation. In the above case where etcd-main-0 became unhealthy, druid can first update etcd-main-0 to ensure that quorum is still maintained by the other two other members. The pod spec update can potentially fix any problem with etcd-main-0 such as an internal error or by rescheduling it to a different node which might not be suffereing from the same network connectivity issues. Druid can then proceed with the updation of the rest of the etcd pods. In essence, this method reduces the likelihood of a artificially indiced quorum loss caused by a badly ordered updation of etcd pods in the cluster.

Changing the updateStrategy of the statefulset to OnDelete is also beneficial in the case of rolling the volumes backing the etcd pods, as explained by @unmarshall in https://github.com/gardener/gardener-extension-provider-aws/issues/646 and further discussed in https://github.com/gardener/etcd-druid/issues/481.

Note: components such as VPA or HVPA which currently directly update the statefulset spec with new container resource recommendations will need to be specially accommodated in the new approach by possibly adding a new predicate for the etcd controller to also react to changes in the statefulset's spec.template.spec.containers[*].resources field and trigger reconciliations accordingly, so that the underlying pods are updated with the new resource recommendations from VPA/HVPA.

unmarshall commented 7 months ago

There is an issue when using onDelete which should be kept in mind. https://github.com/kubernetes/kubernetes/issues/73492

ishan16696 commented 1 month ago

Consider a 3-member etcd cluster etcd-main, with pods etcd-main-0, etcd-main-1 and etcd-main-2 running, in the pink of health. At this point, if etcd-main-0 (or etcd-main-1) becomes unhealthy due to multiple reasons (network connectivity issues, zone outages, node failure, or simply an etcd issue which might be resolvable by restarting the pod), then the etcd cluster still maintains quorum with the other two healthy members, but is now only one step away from losing quorum. What happens now if there's an update to the etcd statefulset spec, like a change in etcd-backup-restore image version or a configuration change to the etcd or etcdbrctl processes? The statefulset controller starts rolling the pods starting with etcd-main-2. As soon as it deletes this pod to make room for the updated pod, the cluster loses quorum. This leads to a downtime of the etcd, subsequently causing a downtime to the kube-apiserver whom the etcd is backing, until the updated etcd-main-2 pod comes back up.

this scenario can also be avoided by setting spec.updateStrategy.rollingUpdate.maxUnavailable to 1 but this feature is only available via feature gate of api-server and it's still alpha feature. I was just wondering, can we use this feature till etcd-druid moved to onDelete strategy ?

ishan16696 commented 1 month ago

this scenario can also be avoided by setting spec.updateStrategy.rollingUpdate.maxUnavailable to 1 but this feature is only available via feature gate of api-server and it's still alpha feature. I was just wondering, can we use this feature till etcd-druid moved to onDelete strategy ?

We decided that as it's an alpha feature we won't be using this and will be directly moving to OnDelete strategy.

unmarshall commented 1 month ago

Rough Discussion notes: On-Delete strategy for stateful set.md

seshachalam-yv commented 3 weeks ago

After an offline meeting with @ashwani2k , @ishan16696 , and @renormalize , we discussed three scenarios related to the OnDelete strategy:

Safe-to-Evict Flag and Voluntary Disruptions:

If we set cluster-autoscaler.kubernetes.io/safe-to-evict: "false", consider a scenario where the Vertical Pod Autoscaler (VPA) is evicting a pod during voluntary disruptions. Simultaneously, there may be no unhealthy pods, compelling us to select a candidate and trigger pod deletion to apply the latest updates. This scenario could potentially lead to transient quorum loss because Pod Disruption Budgets (PDBs) are not respected due to direct deletion calls made by the OnDelete pod updater.

Simultaneous Reconciliation and Node Update:

Suppose etcd reconciliation and node reconciliation (rolling update of a node) occur concurrently, leading to a node drain attempting to evict a pod. Simultaneously, the OnDelete update component might also select this candidate and trigger deletion. This action can cause transient quorum loss because PDBs are not respected by the deletion calls from the OnDelete pod updater.

Etcd Reconciliation with Node-Pressure Eviction:

Consider a scenario where, during etcd reconciliation, we select a candidate and trigger pod deletion. At the same time, due to high utilization of an already running pod, the kubelet may initiate a node-pressure eviction, an involuntary disruption that does not respect PDBs.

From our brainstorming session, we concluded that we should use the eviction API whenever we are deleting a healthy pod during voluntary disruptions. This approach ensures that any simultaneous involuntary disruptions can be mitigated by PDBs.

Key Takeaways:

Evict Healthy Pods: Use the eviction API to manage healthy pods, respecting PDBs and preventing disruptions. Delete Unhealthy Pods: Directly delete unhealthy pods when necessary, as they are not necessarily protected by PDBs.

By adopting this strategy, we safeguard against scenarios 1 and 2 and, to some extent, can also prevent scenario 3 if node-pressure eviction occurs before our deletion process begins. This method ensures greater stability and reliability in managing our Kubernetes resources.

For further understanding of Voluntary and Involuntary disruptions, you can read more here.

If PR #855 is merged, we can fully utilize eviction for managing pods

gardener / etcd-druid