gardener / etcd-druid

An etcd operator to configure, provision, reconcile and monitor etcd clusters.
https://gardener.github.io/etcd-druid/
Apache License 2.0
77 stars 50 forks source link

[Feature] Allow druid to delete and trigger recreation of PVCs on-demand #481

Open shreyas-s-rao opened 1 year ago

shreyas-s-rao commented 1 year ago

Feature (What you would like to be added): I would like to leverage the data restoration functionality of the etcd-backup-restore sidecar (in both single- and multi-node etcd clusters) to allow druid to delete PVC(s) on-demand and trigger recreation of the PVC(s) later.

Motivation (Why is this needed?): There are use-cases where user wants to switch storage class of the PVC to a better one, or to change the volume size of the etcd disk, or maybe to switch to encrypted disks. Today, any changes to the volume configuration in the Etcd CR leads to reconciliation errors, as the existing statefulset forbids any updates to the volumeClaimTemplate, as can be seen from https://github.com/gardener/gardener-extension-provider-aws/issues/646#issuecomment-1330188838. I would like druid to catch such errors, and possibly even gracefully handle them in a certain way, if I would specify it via maybe annotation(s) on the Etcd CR.

Approach/Hint to the implement solution (optional): Introduce annotations in the Etcd CR, something like druid.gardener.cloud/operation: "recreate-pvc/etcd-main-0" and druid.gardener.cloud/operation: recreate-sts (if necessary), that druid sees and does the following:

  1. Scale down the statefulset (if it isn't already scaled down)
  2. Delete the PVC associated with the pod specified in the annotation value - maybe there needs to be a better way to specify this annotation
  3. Delete the statefulset (to allow updating the volumeClaimTemplate of the statefulset, which is a "forbidden" field, ie, it is forbidden to update this field of the statefulset spec) - based on the recreate-sts annotation
  4. Continue with regular reconciliation of the Etcd resource, which will recreate the statefulset and subsequently recreate the PVC and restore the data (in case of single-node etcd) or sync data with the leader (in case of multi-node etcd)

Also enhance the immutableFieldUpdate function introduced in https://github.com/gardener/etcd-druid/pull/408 and see how this can be leveraged for the new druid.gardener.cloud/operation: recreate-sts annotation, if necessary.

cc @unmarshall @timuthy @vlerenc

timuthy commented 1 year ago

API-wise this approach seems procedural which we probably want to avoid.

Do we need to expose a function-style operation or isn't it possible to stick to Kubernetes's well known desired/actual state paradigm, i.e. API changes trigger changes that Druid performs in order to transform the actual to the desired one?

shreyas-s-rao commented 1 year ago

isn't it possible to stick to Kubernetes's well known desired/actual state paradigm, i.e. API changes trigger changes that Druid performs in order to transform the actual to the desired one?

That was my first thought as well. The issue here is that any change to volumeClaimTemplate of the statefulset is forbidden. If we are to allow changing the volume-related spec in the Etcd resource (storageCapacity, storageClass), we'll need to define a strict behavior for it, like if a change is made to either of these fields, druid must always recreate the statefulset, accompanied with a deletion of the volume". But that's risky because it means deletion of etcd data.

Of course we can safeguard this operation with something similar to druid.gardener.cloud/pvc-deletion-confirmation and such, and also compulsorily make druid trigger a full snapshot of the etcd before scaling down the statefulset. But the fact remains that the whole operation remains more or less "procedural" in a strict sense.

Also, Etcd resource does not currently store any state for the volumes it uses (unless we plan to do that using the EtcdMembers field), so there's no strict desired/actual state maintained for the etcd volumes, atleast at the moment.

shreyas-s-rao commented 1 year ago

Also, other use-cases of this functionality of being able to programatically delete a PVC on-demand are:

  1. Volume deletion upon shoot cluster hibernation: allows us to save costs when the etcd statefulset is scaled down, as long as we have safeguarded the etcd data using backups. Druid can trigger a full snapshot of the etcd data before PVC deletion.
  2. Handling permanent quorum loss: deletion of PVCs on-demand allows us to move atleast one step closer to the automation of handling permanent quorum loss for the multinode etcd cluster. Of course the main challenge there is accurately detecting a permanent quorum loss in the first place, but the operator need only add one annotation to the Etcd resource and the rest is taken care of by druid, as part of a handlePermanentQuorumLoss flow, which reuses the deletePVC flow

/cc @abdasgupta

vlerenc commented 1 year ago

Pro:

Con:

So, whether we automate the volume replacement (scripts have risks themselves, so a well-tested and safe-guarded druid implementation may still be the better option, even if we are all fearful of volume deletion code), I would still like to raise the question, whether we can offer a way to do that w/o a downtime as initially hoped/planned (HA became the prerequisite for volume replacement for critical clusters and scaling down the stateful set, would defy this goal).

shreyas-s-rao commented 1 year ago

Yes, the stateful set would show "the wrong volume template", but it would adopt recreated volumes, so that these changes can be implemented without any downtime, right?

This would lead to inconsistency in the Kubernetes's well known desired/actual state paradigm mentioned earlier by @timuthy . The zero-downtime update can still be done, using the steps provided by @unmarshall in https://github.com/gardener/gardener-extension-provider-aws/issues/646#issuecomment-1330188838. So essentially, we can make an Etcd spec change for storageSize / storageClass, accompanied by something like a confirmation annotation to make sure that volumes aren't deleted without explicit confirmation. @vlerenc WDYT?

vlerenc commented 1 year ago

Yes, exactly. I was commenting on the sentence "Scale down the statefulset (if it isn't already scaled down)" and if you have the capacity to automate https://github.com/gardener/gardener-extension-provider-aws/issues/646#issuecomment-1330188838 that would be most welcome.

shreyas-s-rao commented 1 year ago

/assign

shreyas-s-rao commented 1 year ago

/assign @seshachalam-yv

shreyas-s-rao commented 1 year ago

Blocked until #588 is implemented