Worker routines get stuck

timebertt commented 3 years ago

How to categorize this issue?

/area robustness /kind bug /priority normal

What happened:

We have observed some situations, were grm gets stuck reconciling a specific managed resource and does not act upon it anymore. In all cases I observed, it was either happening in conjunction with a longer period of downtime of the source or target API server (before #95) or a large amount of secret data in the target cluster (like described in #92).

What you expected to happen:

grm should not get stuck and reconcile all managed resources with the given sync interval.

How to reproduce it (as minimally and precisely as possible):

Not sure yet. My guess would be that the worker goroutines get stuck in some WaitForCacheSync, when the API server is unavailable for a longer period of time or the amount of watched data is to big.

Anything else we need to know?:

Environment:

Gardener-Resource-Manager version: v0.20.0
Kubernetes version (use kubectl version):
Cloud provider or hardware configuration:
Others:

timebertt commented 3 years ago

I think, a possible solution or at least one good first step would be to use a context with a timeout for each reconciliation (e.g. 1m). This way, the WaitForCacheSync funcs will return with false and the key will be marked done in the queue, so it can be reconciled again.

rfranzke commented 3 years ago

/assign

gardener-attic / gardener-resource-manager

Worker routines get stuck #99