Open shreyas-s-rao opened 1 year ago
Another candidate for improvement is the current behavior of BackupReady
condition, which shows a non-True
status for a fresh etcd cluster, even after the initial full snapshot succeeds. Only after the first delta snapshot succeeds, the condition is set to True
. This is incorrect, since the backup is technically healthy until proven otherwise by a delta snapshot upload failure.
Additionally, since Gardener relies on this condition, it does not mark the shoot cluster as healthy unless the BackupReady condition is healthy, thus increasing the time to mark the shoot is healthy.
With #867 , this should no longer be an issue.
Feature (What you would like to be added): Streamline etcd resource status to make it more meaningful and accurate for human operators.
Motivation (Why is this needed?): Currently, etcd status has the following fields:
ObservedGeneration
Etcd
Conditions
ServiceName
LastError
ClusterSize
CurrentReplicas
Replicas
ReadyReplicas
Ready
UpdatedReplicas
LabelSelector
Members
PeerUrlTLSEnabled
As part of https://github.com/gardener/etcd-druid/pull/594, many of these fields have already been marked deprecated, such as
Status.ClusterSize
,Status.ServiceName
andStatus.UpdatedReplicas
, and will be removed in the future. But there are still many fields which are not really required to deduce the state of the etcd resource, such asStatus.Etcd
(which provides a self-reference to the Etcd object and isn't required since anybody who has access to the etcd status already has a reference to the etcd object itself),Status.Ready
(readiness is a state, and can be better represented by acondition
denoting quorum in the cluster, which means that the cluster is "ready" to serve traffic),Status.Members
(which will probably be replaced by something likeStatus.MemberRefs
via https://github.com/gardener/etcd-druid/issues/206), and fieldsStatus.Replicas
,Status.ReadyReplicas
andStatus.CurrentReplicas
which are blindly copied over from the statefulset status and provide no meaningful information about the cluster, but rather druid should to the work of deducing such member-specific info and correctly populating conditions.The
Status.Conditions
types that set today are:Ready
Denotes readiness of the cluster, but for a human operator this condition would make more sense when renamed to
QuorumReached
, to denote that the cluster has reached quorum, which means it's ready to serve traffic.AllMembersReady
Currently, this does not provide any information that
QuorumReached
cannot already provide. We would still require this condition to deduce whether every member is alive, ie, both containers in the pod are up. In such a case, a better name for this condition would beAllMembersAlive
since readiness of each member already denotes readiness of the cluster, whereas liveness of a member has no bearing on whether the cluster is ready or not. In order to enable this, we will need to enhance the lease renewal logic in the backup sidecar to also take into consideration the liveness of its etcd member, via a serializable GET call to the etcd (which can be achieved via a liveness endpoint on the etcd-wrapper container).BackupReady
This denotes whether the backups are uploaded on schedule as expected and are not stale. It considers both full and delta backups, and is computed using a complex logic that considers various different combinations of the states of the full snapshot lease and delta snapshot lease, which becomes quite complex to read and maintain. It would make better sense to split this condition into two simpler conditions for full and delta snapshot staleness respectively, so that druid can individually compute the condition state for both these and populate them individually, providing a cleaner view to human operators.
Approach/Hint to the implement solution (optional):
Proposed status fields to be removed/deprecated, or already deprecated:
Etcd
ServiceName
ClusterSize
Replicas
ReadyReplicas
CurrentReplicas
UpdatedReplicas
LabelSelector
LastError
Proposed conditions:
/area usability