leaderElection issue on `v0.2.7`

nishantapatil3 commented 1 year ago

Describe the bug

Bug on v0.2.7 version seen after my commit https://github.com/cisco-open/cluster-registry-controller/pull/33

A scenario that I want to discuss further. (I found this today)

Legend: v0.2.2 old cluster registry - OCR v0.2.7 new cluster registry - NCR

When a NCR is deployed OCR is currently a leader but NCR will not be ready unless a CA Bundle is generated and injected into ValidatingWebhook which leads to NCR never being ready(not getting leaderElection)

By default leaderElection is true in CR - https://github.com/cisco-open/cluster-registry-controller/blob/master/deploy/charts/cluster-registry/values.yaml#L54

This can be solved by A) force leaderElection on NCR (dont know how to) B) disable leaderElection as Webhook check is validated periodically here - https://github.com/cisco-open/cluster-registry-controller/blob/master/pkg/cert/renewer.go#L117

for example: while upgrading from OCR to NCR there a short window where two cluster-registries will be deployed, one will be a leader and other waiting to be a leader

v0.2.2 cluster-registry: /metrics as readiness probe comes up without webhook validation and marks as ready there by terminating the old pod of cluster-registry

v0.2.7 cluster-registry: /readyz as readiness probe is not marked as ready unless wehook is ready which is where the webhook awaits for leaderelection to generate ca Bundle and mark new pod of cluster registry as ready and there by kill the old pod

Steps to reproduce the issue

deploy helm chart with replicaset:2 and leaderelection enabled and check if both pods are ready

helm install --set replicaset=2 -n cluster-registry cluster-registry deploy/charts/cluster-registry

Expected behavior to set readiness probe to ready(with webhook CA Bundle ready) before leaderElection

Screenshots

Continues to be in this state until cluster-registry-controller-controller-b8b499b68-llg4r is killed so that leaderElection is transferred to cluster-registry-controller-controller-b8b499b68-nkvcp

Additional context doesn't work if there are two v0.2.7 cluster registries, one pod waits for another to hand over the lease. Sorry I missed checking this before committing into cluster-registry.

Quick Solution If leaderElection: false then the above issue is not seen

BEvgeniyS commented 1 year ago

Hey @nishantapatil3, were you able to solve the issue? I'm on the v0.2.10 and the issue seems to still persist

nishantapatil3 commented 1 year ago

Hey @nishantapatil3, were you able to solve the issue? I'm on the v0.2.10 and the issue seems to still persist

I wasn't able to solve this issue, this might require restructuring on how the certificate is injected into webhook before manager starts.

cisco-open / cluster-registry-controller

leaderElection issue on `v0.2.7` #36

Describe the bug

Steps to reproduce the issue