I think what we should do is as follows:
1) Create a retry watcher. The retry watcher will take care of error
handling for us. The alternative is to add another for loop and
create a basic watch, but then we have to handle particular scenarios
where we have to handle ResourceVersion by ourselves when we create
new watchers. The Retry watcher is designed to handle all kind of
errors, so this is the recommended way to do it.
2) We try to create the retry watcher by doing a few retries. If we
don't succeed, then we should just panic, because there is no point
to try to watch pods, and the application will not work properly.
3) watch errors if the the chan result is not OK are automatically
handled by the retry watcher
4) We should only return an error to the caller if the context is
cancelled.
Tested and it works:
Collatz with many replicas of main, even and odd spread across 3 nodes. 1) Restarted each deployment (even, then odd, then main). The watcher
always recovered. In the worst case the request was hanging for a few
seconds (which make sense when the entire main deployment is
restarted)
2) Drained one node at a time, so all the pods from a node were evicted.
The watcher recovered, and some requests were hanging for a few
seconds at most
3) Randomly deleted 50% of the pods for each of the 3 deployments (main,
collatz, and even) at the same time. The watcher recovered, and
some requests were hanging for a few seconds at most
I think what we should do is as follows: 1) Create a retry watcher. The retry watcher will take care of error handling for us. The alternative is to add another for loop and create a basic watch, but then we have to handle particular scenarios where we have to handle ResourceVersion by ourselves when we create new watchers. The Retry watcher is designed to handle all kind of errors, so this is the recommended way to do it. 2) We try to create the retry watcher by doing a few retries. If we don't succeed, then we should just panic, because there is no point to try to watch pods, and the application will not work properly. 3) watch errors if the the chan result is not OK are automatically handled by the retry watcher 4) We should only return an error to the caller if the context is cancelled.
Tested and it works: Collatz with many replicas of main, even and odd spread across 3 nodes. 1) Restarted each deployment (even, then odd, then main). The watcher always recovered. In the worst case the request was hanging for a few seconds (which make sense when the entire main deployment is restarted) 2) Drained one node at a time, so all the pods from a node were evicted. The watcher recovered, and some requests were hanging for a few seconds at most 3) Randomly deleted 50% of the pods for each of the 3 deployments (main, collatz, and even) at the same time. The watcher recovered, and some requests were hanging for a few seconds at most