gardener / etcd-druid

An etcd operator to configure, provision, reconcile and monitor etcd clusters.
Apache License 2.0
70 stars 49 forks source link

[Enhancement] New condition to ensure all etcd's join a single cluster #595

Open aaronfern opened 1 year ago

aaronfern commented 1 year ago

Enhancement (What you would like to be added): As of today, all etcd-druid conditions rely on all pods running and the etcd cluster being reachable. We log a successful etcd cluster as long as this is true and all etcd's are running. It is a rare possibility, but if old PVCs exist, it may happen that all etcd's do not join the same cluster but may form multiple clusters, all connected to the same service. In this case, etcd-druid sees that all pods and running and will assume a successful cluster.

We need a way for etcd-druid to ensures that all the etcd's join the same cluster and log the result of this check.

Motivation (Why is this needed?): This is needed as all pods are reachable via the same service and if there are multiple clusters, data will be split between them and will lead to data inconsistencies.

Approach/Hint to the implement solution (optional): My proposal right now would be to add a new condition to the etcd status. We would check all renewed leases and ensure that there is only one leader. The condition is logged and it can come to an operators attention so that it can be fixed. When we introduce member state, this functionality can be moved there.

aaronfern commented 1 year ago

/assign

ishan16696 commented 9 months ago

Approach/Hint to the implement solution (optional): My proposal right now would be to add a new condition to the etcd status. We would check all renewed leases and ensure that there is only one leader. The condition is logged and it can come to an operators attention so that it can be fixed. When we introduce member state, this functionality can be moved there.

Things will change going forward as we will be using etcd-member custom resource https://github.com/gardener/etcd-druid/pull/658 As @aaronfern is not working on this issue, so I'm unassigning @aaronfern from this issue.