r.hasReconcileAnnotation() is true since gardener adds the reconcile annotation.
specUpdated() is false as there is no change to the spec in this event.
lastReconcileHasFinished() is false since the first time around the event was not processed so no status is present yet.
r.autoReconcileEnabled() is false as its not auto reconciled.
As a consequence onReconcileAnnotationSetPredicate predicate will evaluate to false and autoReconcileOnSpecChangePredicate predicate will evaluate to false thus rejecting the event.
The result is that for a long time after the Etcd resource is created, it does not get reconciled. This is time sensitive and it all depends upon how fast the informer cache is updated or how late the create event arrives and if the first create event gets processed.
What you expected to happen:
The predicate should be improved to allow subsequence update events even if no spec has changed especially when there is no status (indicating that it never got reconciled). For gardener use case an update event will be received much sooner but we need to also solve this for non-gardener use cases where we are depending on cache.SyncPeriod which is by default set to 10hr.
How to reproduce it (as minimally and precisely as possible):
It is not always possible to recreated. Create multiple etcd clusters via local gardener and for one or more etcd clusters you will see that it does not get reconciled and only after a long time it gets reconciled.
etcd-druid version: v0.23.x
Thanks to @shafeeqes, we were able to find this reason for the following behavior of v0.23.x version of etcd-druid. We also saw this during g/g e2e tests runs but always thought as a flake test since in subsequent attempts the tests passed. It got masked because tests intermittently also failed due to non-etcd reasons making the e2e tests quite flaky.
How to categorize this issue?
/area control-plane /kind bug
What happened:
A new
Etcd
resource is created. Sinceetcd-reconciler
is watching forEtcd
events, it gets aCreate
event. This even is allowed in. During the reconciliation loop an attempt is made to get the resource: https://github.com/gardener/etcd-druid/blob/df3ff21d8cd9d3d785309223f543d72518dbeed2/internal/controller/etcd/reconciler.go#L135-L137 It is possible that the informer caches are not yet updated.client.Get
returnsNotFound
error. This results in the following: https://github.com/gardener/etcd-druid/blob/df3ff21d8cd9d3d785309223f543d72518dbeed2/internal/controller/utils/reconciler.go#L42-L44 The reconciler is short circuited and the no further processing is done.The default cache resync is 10hrs, but in case of gardener, it reconciles again and with every reconcile it adds the following:
See here.
This will generate another event much sooner than the default cache resync period of 10hrs giving etcd-druid another chance to reconcile the event. However this event gets filtered-out and is not processed. See: https://github.com/gardener/etcd-druid/blob/df3ff21d8cd9d3d785309223f543d72518dbeed2/internal/controller/etcd/register.go#L53-L75
r.hasReconcileAnnotation()
is true since gardener adds the reconcile annotation.specUpdated()
is false as there is no change to the spec in this event.lastReconcileHasFinished()
is false since the first time around the event was not processed so no status is present yet.r.autoReconcileEnabled()
is false as its not auto reconciled.As a consequence
onReconcileAnnotationSetPredicate
predicate will evaluate to false andautoReconcileOnSpecChangePredicate
predicate will evaluate to false thus rejecting the event.The result is that for a long time after the
Etcd
resource is created, it does not get reconciled. This is time sensitive and it all depends upon how fast the informer cache is updated or how late the create event arrives and if the first create event gets processed.What you expected to happen:
The predicate should be improved to allow subsequence update events even if no spec has changed especially when there is no status (indicating that it never got reconciled). For gardener use case an update event will be received much sooner but we need to also solve this for non-gardener use cases where we are depending on cache.SyncPeriod which is by default set to 10hr.
How to reproduce it (as minimally and precisely as possible):
It is not always possible to recreated. Create multiple etcd clusters via local gardener and for one or more etcd clusters you will see that it does not get reconciled and only after a long time it gets reconciled.