gardener / etcd-druid

An etcd operator to configure, provision, reconcile and monitor etcd clusters.
https://gardener.github.io/etcd-druid/
Apache License 2.0
77 stars 50 forks source link

Slow refresh of informer cache results in delayed processing of Etcd resources #898

Open unmarshall opened 1 month ago

unmarshall commented 1 month ago

How to categorize this issue?

/area control-plane /kind bug

What happened:

A new Etcd resource is created. Since etcd-reconciler is watching for Etcd events, it gets a Create event. This even is allowed in. During the reconciliation loop an attempt is made to get the resource: https://github.com/gardener/etcd-druid/blob/df3ff21d8cd9d3d785309223f543d72518dbeed2/internal/controller/etcd/reconciler.go#L135-L137 It is possible that the informer caches are not yet updated. client.Get returns NotFound error. This results in the following: https://github.com/gardener/etcd-druid/blob/df3ff21d8cd9d3d785309223f543d72518dbeed2/internal/controller/utils/reconciler.go#L42-L44 The reconciler is short circuited and the no further processing is done.

The default cache resync is 10hrs, but in case of gardener, it reconciles again and with every reconcile it adds the following:

metav1.SetMetaDataAnnotation(&e.etcd.ObjectMeta, v1beta1constants.GardenerOperation, v1beta1constants.GardenerOperationReconcile)
metav1.SetMetaDataAnnotation(&e.etcd.ObjectMeta, v1beta1constants.GardenerTimestamp, TimeNow().UTC().Format(time.RFC3339Nano))

See here.

This will generate another event much sooner than the default cache resync period of 10hrs giving etcd-druid another chance to reconcile the event. However this event gets filtered-out and is not processed. See: https://github.com/gardener/etcd-druid/blob/df3ff21d8cd9d3d785309223f543d72518dbeed2/internal/controller/etcd/register.go#L53-L75

As a consequence onReconcileAnnotationSetPredicate predicate will evaluate to false and autoReconcileOnSpecChangePredicate predicate will evaluate to false thus rejecting the event.

The result is that for a long time after the Etcd resource is created, it does not get reconciled. This is time sensitive and it all depends upon how fast the informer cache is updated or how late the create event arrives and if the first create event gets processed.

What you expected to happen:

The predicate should be improved to allow subsequence update events even if no spec has changed especially when there is no status (indicating that it never got reconciled). For gardener use case an update event will be received much sooner but we need to also solve this for non-gardener use cases where we are depending on cache.SyncPeriod which is by default set to 10hr.

How to reproduce it (as minimally and precisely as possible):

It is not always possible to recreated. Create multiple etcd clusters via local gardener and for one or more etcd clusters you will see that it does not get reconciled and only after a long time it gets reconciled.

Thanks to @shafeeqes, we were able to find this reason for the following behavior of v0.23.x version of etcd-druid. We also saw this during g/g e2e tests runs but always thought as a flake test since in subsequent attempts the tests passed. It got masked because tests intermittently also failed due to non-etcd reasons making the e2e tests quite flaky.

unmarshall commented 1 month ago

This issues was first observed in g/g e2e tests. See issue: https://github.com/gardener/gardener/issues/10739