gardener / etcd-druid

An etcd operator to configure, provision, reconcile and monitor etcd clusters.
Apache License 2.0
74 stars 50 forks source link

[Enhancement] Capture LKG status of applied configuration as part of etcd status to help with etcd scale up #416

Open unmarshall opened 2 years ago

unmarshall commented 2 years ago

Feature (What you would like to be added): Today etcd-druid only reacts to changes done to the etcd CRD. Currently it does not know what was the last successfully applied change(s) in the etcd spec. As part of this enhancement we start to capture successfully applied configuration as part of the status so that the controllers are enabled to compare the last-known-good (LKG) state vs the current state of the etcd resource and take appropriate action if needed.

Motivation (Why is this needed?): Motivation is the use case for upgrading etcd cluster from a single-node cluster to a multi-node cluster. Currently for a single node cluster Peer URL is not TLS enabled as there is currently no peer. When etcd resource is changed resulting in upgrading the etcd cluster from single to multi-node then secure peer communication is required. TLS configuration to enable peer-to-peer communication is required. To enable scale-up of the etcd cluster, the existing member needs to update its peer URL and make it TLS enabled, so that when additional members start and try and join the cluster (one learner at a time) then they are able to establish peer communication over HTTPS. Change in the peer URL of the existing member requires a mandatory restart of the etcd process (see here). In the current setup this will result in a total of 2 restarts of the etcd pod before the peer URL of the existing member (single node etcd cluster) reflects a TLS enabled URL. To prevent 2 restarts the idea is to delete the StatefulSet and create it again (which will result in a single restart).

etcd-druid needs to know what has changed in the spec in order to conditionally delete STS it needs to what has changed in the spec. controller-runtime does not allow visibility into what has changed. This was possible when using client-go. Therefore we need to capture the LKG configuration as part of the status of the etcd resource.

unmarshall commented 2 years ago

/assign