Closed timebertt closed 3 years ago
I think, a possible solution or at least one good first step would be to use a context with a timeout for each reconciliation (e.g. 1m).
This way, the WaitForCacheSync
funcs will return with false
and the key will be marked done in the queue, so it can be reconciled again.
/assign
How to categorize this issue?
/area robustness /kind bug /priority normal
What happened:
We have observed some situations, were grm gets stuck reconciling a specific managed resource and does not act upon it anymore. In all cases I observed, it was either happening in conjunction with a longer period of downtime of the source or target API server (before #95) or a large amount of secret data in the target cluster (like described in #92).
What you expected to happen:
grm should not get stuck and reconcile all managed resources with the given sync interval.
How to reproduce it (as minimally and precisely as possible):
Not sure yet. My guess would be that the worker goroutines get stuck in some
WaitForCacheSync
, when the API server is unavailable for a longer period of time or the amount of watched data is to big.Anything else we need to know?:
Environment:
kubectl version
):