Open shreyas-s-rao opened 1 year ago
/assign @aaronfern @abdasgupta
For improving integration tests and e2e tests, and possibly adding chaos tests.
Participants: @shreyas-s-rao, @unmarshall
State as of today
etcd-reconciler
Today the etcd reconciler reacts to changes made to etcd
resource when a reconcile annotation is set on the etcd
resource. Among other things, it uses component.OpWaiter
to deploy and wait for stateful set to reach the desired state as defined in waitDeploy. If the desired state is not reached then it times-out and returns an error which causes a requeue of an event. Upon completion of a reconciliation run(successfully or unsuccessfully) it updates the following in the etcd status:
etcd.Status.lastError
etcd.Status.Ready
etcd.Status.Replicas
etcd.Status.ObservedGeneration
custodian-reconciler
Watches for:
etcd
resource for which the reconciliation has been finished by the etcd-reconciler. See here.StatefulSet
events, maps it to the owner etcd
resource and enqueues the request to the reconciler.Only updates the etcd status if etcd.Status.LastError
is empty. This ensures that etcd-reconciler has successfully completed its reconciliation run.
It updates the following:
etcd.Status.Members
etcd.Status.Conditions
etcd.Status.CurrentReplicas
etcd.Status.ReadyReplicas
etcd.Status.Ready
We propose changes in responsibilities of these 2 reconcilers.
etcd-reconciler
Following changes should be done:
etcd
resourceetcd.Status.ObservedGeneration
etcd.Status.lastError
etcd-custodian
Following changes should be done:
etcd.Status
except the following:
etcd.Status.lastError
- to be updated by etcd-reconciler as it is directly linked to reconciliation operation.etcd.Status.ObservedGeneration
- to be updated by etcd-reconciler as it is directly linked to reconciliation operation./assign
Feature (What you would like to be added): Re-look at design and implementation of etcd-druid controllers, especially in terms of segregation of responsibilities. Redesign or refactor as required.
Motivation (Why is this needed?): There are certain resources that are updated by different agents, such as Etcd status subresource which is updated by both etcd controller and custodian controller - this potentially leads to many race condition scenarios, which then need to be mitigated by some extra coordination steps. This feels complicated, and needs to be simplified.
We also need to re-look at which resources are watched by each controller, and which resources the controller receives in its request, and which resource it finally acts upon. The mapping of these dependent resources needs to be re-looked at.
Finally, we need good unit testing for modular units of code, as well as proper documentation of the APIs and design, as well as design decisions taken.
Approach/Hint to the implement solution (optional): Prerequisite to this story is https://github.com/gardener/etcd-druid/issues/512, which is implemented by https://github.com/gardener/etcd-druid/pull/506 and provides a basis for further refactoring efforts.
This story encompasses the following tasks:
/area quality /area testing /area documentation /area dev-productivity