gardener / etcd-druid

An etcd operator to configure, provision, reconcile and monitor etcd clusters.
Apache License 2.0
72 stars 49 forks source link

[BUG] Etcd resource status is not properly updated #484

Open vlerenc opened 1 year ago

vlerenc commented 1 year ago

Describe the bug: Just a small thing. When only 2 out of 3 pods are available (I forcefully scaled down the statefulset continuously to 2 replicas in an endless loop), the etcd resource status is not properly updated and at least the quorum information is stale (whatever it was before - in my case, because my replica count was 0 before I set it to 2, it remained stuck at false).

Expected behavior: That the status is correctly reflected and in this case, the ETCD cluster had quorum and was operational, so it should have been reported as true.

How To Reproduce (as minimally and precisely as possible): See above.

Logs:

2022-12-09T12:57:08.257Z    INFO    custodian-controller    Updating etcd status with statefulset information   {"etcd": "shoot--core--ha-cp/etcd-events"}
2022-12-09T12:57:08.257Z    INFO    custodian-controller    ETCD status updated for statefulset current replicas: 3, ready replicas: 3, updated replicas: 3 {"etcd": "shoot--core--ha-cp/etcd-events"}
2022-12-09T12:57:09.440Z    INFO    custodian-controller    Requeue item because of last error: retry failed with context deadline exceeded, last error: not enough ready replicas (2/3)    {"etcd": "shoot--core--ha-cp/etcd-main"}
2022-12-09T12:57:09.493Z    INFO    custodian-controller    Requeue item because of last error: retry failed with context deadline exceeded, last error: not enough ready replicas (2/3)    {"etcd": "shoot--core--ha-cp/etcd-main"}
2022-12-09T12:57:10.300Z    INFO    custodian-controller    Updating etcd status with statefulset information   {"etcd": "shoot--core--ha-cp/etcd-events"}
2022-12-09T12:57:10.300Z    INFO    custodian-controller    ETCD status updated for statefulset current replicas: 3, ready replicas: 3, updated replicas: 3 {"etcd": "shoot--core--ha-cp/etcd-events"}
2022-12-09T12:57:11.697Z    INFO    custodian-controller    Updating etcd status with statefulset information   {"etcd": "shoot--core--ha-cp/etcd-events"}
2022-12-09T12:57:11.697Z    INFO    custodian-controller    ETCD status updated for statefulset current replicas: 3, ready replicas: 3, updated replicas: 3 {"etcd": "shoot--core--ha-cp/etcd-events"}
2022-12-09T12:57:12.900Z    INFO    custodian-controller    Updating etcd status with statefulset information   {"etcd": "shoot--core--ha-cp/etcd-events"}
2022-12-09T12:57:12.900Z    INFO    custodian-controller    ETCD status updated for statefulset current replicas: 3, ready replicas: 3, updated replicas: 3 {"etcd": "shoot--core--ha-cp/etcd-events"}
2022-12-09T12:57:15.736Z    INFO    custodian-controller    Requeue item because of last error: retry failed with context deadline exceeded, last error: not enough ready replicas (2/3)    {"etcd": "shoot--core--ha-cp/etcd-main"}
2022-12-09T12:57:17.629Z    INFO    custodian-controller    Requeue item because of last error: retry failed with context deadline exceeded, last error: not enough ready replicas (2/3)    {"etcd": "shoot--core--ha-cp/etcd-main"}
2022-12-09T12:57:18.306Z    INFO    custodian-controller    Requeue item because of last error: retry failed with context deadline exceeded, last error: not enough ready replicas (2/3)    {"etcd": "shoot--core--ha-cp/etcd-main"}

Anything else we need to know?: The issue was discussed with @aaronfern out-of-band in https://sap-ti.slack.com/archives/C0177NLL8V9/p1670587196752109.

abdasgupta commented 1 year ago

I looked into the issue. In my case, the custodian controller is updating the ETCD resource fine:

We have a condition in Custodian Controller that ignores all the reconciliation of status if there is an error message present in ETCD CR in https://github.com/gardener/etcd-druid/blob/09e62b2de053e45fe2faddc352180b3defd6164a/controllers/etcd_custodian_controller.go#L91 Probably, @vlerenc faced some error and after that he tried to scale down the ETCD cluster. So, custodian controller did not update the scaled down version. As a mitigation to this bug, I can make changes in custodian controller so that it update the status even before checking the presence of error in ETCD CR. Is it okay @vlerenc ?

abdasgupta commented 1 year ago

I can't reproduce the issue on my system. I am closing this ticket. Please reopen if the issue is faced again. /close