Watch pod changes with a RetryWatcher, so we can handle recoverable errors

ServiceWeaver / weaver-kube

Run Service Weaver applications on vanilla Kubernetes.

Apache License 2.0

61 stars 19 forks source link

Great finding, @rwrz . Just to recap what @rwrz discovered.

By default, the watcher stops watching after 30 minutes. In the current implementation, we don't handle errors (event triggered when the watcher stops watching). So any changes in the pods will not be updated in the routing info, hence the application's request will hang forever.

To fix this we need to recreate the watcher on errors (or after the timeout). RetryWatcher is already doing that, so this is the solution we'll adopt.

I tested by:

deploy a version of the app with a watcher timeout of 3 minutes
restart deployments for non-main services before the 3 minute mark, and the requests still go through
restart deployments after the 3 minute mark (after the watcher terminates), and see that no request is being server

Added a RetryWatcher and repeated the experiment and everything works fine.

ServiceWeaver / weaver-kube

Watch pod changes with a RetryWatcher, so we can handle recoverable errors #108