☂ Druid Refactor to Address Multiple Controller Conflicts

seshachalam-yv commented 11 months ago

Feature (What you would like to be added):

Avoid Race Conditions: To eliminate race conditions, the redesign proposes all etcd status updates be managed solely by the etcd controller. This includes not waiting for StatefulSet (sts) readiness for spec reconciliation and capturing Persistent Volume Claim (PVC) errors in a dedicated etcd condition and get rid of custodian controller (or repurpose it to a care controller).
Codebase Refactoring: To prevent downstream projects from vendoring all internal code from druid and encountering cyclic dependency issues, the solution involves moving packages into internal.
Boilerplate Code Removal: To streamline the codebase, the redesign proposes replacing the component model with the ResourceOperator model, which would eliminate the need for boilerplate ocomponents like values_helper.go files.
Webhook for Safeguarding Owned Resources: Validating webhook is proposed to safeguard against unintentional updates or deletions of owned resources. This webhook is crucial as it allows changes to these resources exclusively when etcd reconciliation or deletion activities are underway. The introduction of this webhook effectively renders the custodian (or care) controller redundant, streamlining the system’s management.
Improve Observability and Debuggability
- Introduce LastOperation and LastErrors as part of etcd status. Ensure that etcd reconciler captures relevant operations and errors thus improving visibility of reconciliation process.
- Add RunID for every reconciler run which improves debuggability. On can now filter logs based on runID thus providing all logs for all actions/errors done/encountered in a single reconciler run.
Streamline Labels on resources Ensure that we use k8s recommended labels for all resources that are provisioned by druid.
Streamline error handling - Introduce custom error type to capture contextual information and start to use it in operators and controllers. Druid errors captured will then be use to populate the etcd resource status LastErrors as well.
Fixes the following bugs:
1. In utils/lease.go - function isPeerURLTLSEnabled does not treat absence of annotation as false and currently ignores it. This results in an incorrect state of the overall status of peer URL TLS enablement.
Enhancements to peer TLS handling: The current implementation in druid to enable peer TLS for etcd clusters is fragile, and is prone to errors if druid crashes during the process of updating peer TLS for the etcd cluster. This will now be made deterministic and quicker, by eliminating the need for druid to perform a second restart of the member pods for updating member peer URLs.
Fixed BackupRestore TLS volume mounts Even though ETCD spec provides a placeholder for separately specifying TLS configuration for backup-restore it is not really used. The reason it has worked till now is because the CA, Server and Client certificates are same for both etcd-backup-restore and etcd and this is the case for gardener. However this will clearly stop to work if etcd.Spec.Backup.TLS configures it differently as then it will not be honoured.

Motivation (Why is this needed?):

seshachalam-yv commented 11 months ago

/assign @shreyas-s-rao

ishan16696 commented 5 months ago

Enhancements to peer TLS handling: The current implementation in druid to enable peer TLS for etcd clusters is fragile, and is prone to errors if druid crashes during the process of updating peer TLS for the etcd cluster. This will now be made deterministic and error-free by allowing etcd-backup-restore container to update the peer TLS and restart the etcd within the same flow, thus eliminating the need for druid to perform a second restart of the member pods. https://github.com/gardener/etcd-backup-restore/issues/712 https://github.com/gardener/etcd-wrapper/issues/24

In this point , you have described something but after that you have mentioned the 2 PRs. Are those PRs require to achieve point 9 ? If not then please remove them from issue description. If yes then please mention it that these PRs are also require to achieve point 9 as it ambiguous right now.

shreyas-s-rao commented 5 months ago

@ishan16696 thanks for reminding. Updated the issue now.

gardener / etcd-druid

☂ Druid Refactor to Address Multiple Controller Conflicts #728