gardener / etcd-druid

An etcd operator to configure, provision, reconcile and monitor etcd clusters.
Apache License 2.0
74 stars 50 forks source link

[Enhancement] Streamline etcd resource status #645

Open shreyas-s-rao opened 1 year ago

shreyas-s-rao commented 1 year ago

Feature (What you would like to be added): Streamline etcd resource status to make it more meaningful and accurate for human operators.

Motivation (Why is this needed?): Currently, etcd status has the following fields:

As part of https://github.com/gardener/etcd-druid/pull/594, many of these fields have already been marked deprecated, such as Status.ClusterSize, Status.ServiceName and Status.UpdatedReplicas, and will be removed in the future. But there are still many fields which are not really required to deduce the state of the etcd resource, such as Status.Etcd (which provides a self-reference to the Etcd object and isn't required since anybody who has access to the etcd status already has a reference to the etcd object itself), Status.Ready (readiness is a state, and can be better represented by a condition denoting quorum in the cluster, which means that the cluster is "ready" to serve traffic), Status.Members (which will probably be replaced by something like Status.MemberRefs via https://github.com/gardener/etcd-druid/issues/206), and fields Status.Replicas, Status.ReadyReplicas and Status.CurrentReplicas which are blindly copied over from the statefulset status and provide no meaningful information about the cluster, but rather druid should to the work of deducing such member-specific info and correctly populating conditions.

The Status.Conditions types that set today are:

Approach/Hint to the implement solution (optional):

Proposed status fields to be removed/deprecated, or already deprecated:

Proposed conditions:

- type: QuorumReached
  status: True|False|Unknown
  reason: MajorityMembersReady | MajorityMembersUnready | QuorumNotChecked
  message: "x/y members ready: <member-id>,<member-id>; z/y members unready: <member-id>" | "Quorum not checked"
- type: AllMembersAlive
  status: True|False|Unknown
  reason: MemberLeasesRenewed | MemberLeasesStale | MemberLeasesNotChecked
  message: "x/y members alive: <member-id>,<member-id>; z/y members not alive: <member-id>" | "Member leases not checked"
- type: FullSnapshotStale/FullSnapshotOutdated
  status: True|False|Unknown
  reason: SnapshotUploadedOnSchedule | SnapshotMissedSchedule | CannotDetermineSnapshotUploadStatus
  message: "Full snapshot uploaded successfully in the last x (time)" | "Cannot determine snapshot upload status"
- type: DeltaSnapshotStale/DeltaSnapshotOutdated
  status: True|False|Unknown
  reason: SnapshotUploadedOnSchedule | SnapshotMissedSchedule | CannotDetermineSnapshotUploadStatus
  message: "Delta snapshot uploaded successfully in the last x (time)" | "Cannot determine snapshot upload status"

/area usability

shreyas-s-rao commented 1 week ago

Another candidate for improvement is the current behavior of BackupReady condition, which shows a non-True status for a fresh etcd cluster, even after the initial full snapshot succeeds. Only after the first delta snapshot succeeds, the condition is set to True. This is incorrect, since the backup is technically healthy until proven otherwise by a delta snapshot upload failure.

Additionally, since Gardener relies on this condition, it does not mark the shoot cluster as healthy unless the BackupReady condition is healthy, thus increasing the time to mark the shoot is healthy.

With #867 , this should no longer be an issue.