ServiceWeaver / weaver-kube

Run Service Weaver applications on vanilla Kubernetes.
Apache License 2.0
61 stars 19 forks source link

Watch pod changes with a RetryWatcher, so we can handle recoverable errors #108

Closed rwrz closed 3 months ago

rwrz commented 3 months ago

The current implementation of the CoreV1 Watch API is not handling any errors. K8s provides a RetryWatcher that can handle recoverable errors for us.

We are using it, because after a while that the cluster is running, the watcher stop working without logging anything. So, I'm using more the LOGs, so we can see what is happening in production and also starting to listen to Watcher errors that are not recoverable. For now, just logging.

Most of the examples I've found of this RetryWatcher are using a "timeout", so it will keep re-creating the watcher all the time. So far, by experience, we don't need it. But I'm keeping it on the code, so we can discuss about it.

rgrandl commented 3 months ago

Great finding, @rwrz . Just to recap what @rwrz discovered.

By default, the watcher stops watching after 30 minutes. In the current implementation, we don't handle errors (event triggered when the watcher stops watching). So any changes in the pods will not be updated in the routing info, hence the application's request will hang forever.

To fix this we need to recreate the watcher on errors (or after the timeout). RetryWatcher is already doing that, so this is the solution we'll adopt.

I tested by:

Added a RetryWatcher and repeated the experiment and everything works fine.