gardener / etcd-druid

An etcd operator to configure, provision, reconcile and monitor etcd clusters.
Apache License 2.0
71 stars 49 forks source link

☂️ Redesign and refactor controllers, testing and documentation #527

Open shreyas-s-rao opened 1 year ago

shreyas-s-rao commented 1 year ago

Feature (What you would like to be added): Re-look at design and implementation of etcd-druid controllers, especially in terms of segregation of responsibilities. Redesign or refactor as required.

Motivation (Why is this needed?): There are certain resources that are updated by different agents, such as Etcd status subresource which is updated by both etcd controller and custodian controller - this potentially leads to many race condition scenarios, which then need to be mitigated by some extra coordination steps. This feels complicated, and needs to be simplified.

We also need to re-look at which resources are watched by each controller, and which resources the controller receives in its request, and which resource it finally acts upon. The mapping of these dependent resources needs to be re-looked at.

Finally, we need good unit testing for modular units of code, as well as proper documentation of the APIs and design, as well as design decisions taken.

Approach/Hint to the implement solution (optional): Prerequisite to this story is https://github.com/gardener/etcd-druid/issues/512, which is implemented by https://github.com/gardener/etcd-druid/pull/506 and provides a basis for further refactoring efforts.

This story encompasses the following tasks:

/area quality /area testing /area documentation /area dev-productivity

shreyas-s-rao commented 1 year ago

/assign @aaronfern @abdasgupta

For improving integration tests and e2e tests, and possibly adding chaos tests.

unmarshall commented 1 year ago

MOM

Participants: @shreyas-s-rao, @unmarshall

State as of today

etcd-reconciler

Today the etcd reconciler reacts to changes made to etcd resource when a reconcile annotation is set on the etcd resource. Among other things, it uses component.OpWaiter to deploy and wait for stateful set to reach the desired state as defined in waitDeploy. If the desired state is not reached then it times-out and returns an error which causes a requeue of an event. Upon completion of a reconciliation run(successfully or unsuccessfully) it updates the following in the etcd status:

custodian-reconciler

Watches for:

Only updates the etcd status if etcd.Status.LastError is empty. This ensures that etcd-reconciler has successfully completed its reconciliation run. It updates the following:

We propose changes in responsibilities of these 2 reconcilers.

etcd-reconciler

Following changes should be done:

etcd-custodian

Following changes should be done:

shreyas-s-rao commented 1 year ago

/assign