ServiceWeaver / weaver-kube

Run Service Weaver applications on vanilla Kubernetes.
Apache License 2.0
61 stars 19 forks source link

fix retry watching logic #110

Closed rgrandl closed 3 months ago

rgrandl commented 3 months ago

I think what we should do is as follows: 1) Create a retry watcher. The retry watcher will take care of error handling for us. The alternative is to add another for loop and create a basic watch, but then we have to handle particular scenarios where we have to handle ResourceVersion by ourselves when we create new watchers. The Retry watcher is designed to handle all kind of errors, so this is the recommended way to do it. 2) We try to create the retry watcher by doing a few retries. If we don't succeed, then we should just panic, because there is no point to try to watch pods, and the application will not work properly. 3) watch errors if the the chan result is not OK are automatically handled by the retry watcher 4) We should only return an error to the caller if the context is cancelled.

Tested and it works: Collatz with many replicas of main, even and odd spread across 3 nodes. 1) Restarted each deployment (even, then odd, then main). The watcher always recovered. In the worst case the request was hanging for a few seconds (which make sense when the entire main deployment is restarted) 2) Drained one node at a time, so all the pods from a node were evicted. The watcher recovered, and some requests were hanging for a few seconds at most 3) Randomly deleted 50% of the pods for each of the 3 deployments (main, collatz, and even) at the same time. The watcher recovered, and some requests were hanging for a few seconds at most