Open janotav opened 1 week ago
Is this still happening with the latest version of the client?
We've seen that watchers might become stale or dead from the kube-api side. This means that events are no longer emitted but the connection remains open.
Both client-go and the Fabric8 Kubernetes Client have mechanisms in place to avoid this in informers by adding an artificial timeout to all watcher connections:
You can achieve the same result by using ListOptionsBuilder().withTimeoutSeconds(...
when opening your watcher. You should then self-recover from the timeout and restart the watcher. This will ensure that your watchers are always listening despite any networking or cluster problems.
The issue doesn't occur on local/testing environment, so I can't easily verify that it doesn't occur with latest version.
Can you please explain what this ListOptions.timeoutSeconds
effectively does? When I tested this locally with a very low value (5 seconds). The watch appears to process events way beyond this interval. What kind of timeout does it impose?
Can you please explain what this
ListOptions.timeoutSeconds
effectively does? When I tested this locally with a very low value (5 seconds). The watch appears to process events way beyond this interval. What kind of timeout does it impose?
That shouldn't be the case, the kube-api server should actually be closing the connection upstream after the given timeout.
You can check the Kubernetes API reference for more information:
https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#watch-pod-v1-core
Query Parameters
Parameter Description timeoutSeconds Timeout for the list/watch call. This limits the duration of the call, regardless of any activity or inactivity.
When invoking the watch through curl
, it behaves as described. I will need to figure out why I am seemingly seeing different behavior through the library.
Is there any performance penalty related to this switch from "infinite" watches to periodically restarted ones? Clearly in order to make this mechanism responsive, the timeout needs to be sufficiently short and I wonder if too short timeout could potentially back-fire due to frequently repeated watches call processing on the server.
AFAIR client-go establishes a default timeout of 5 minutes. In addition, you'd want to add some random jitter time (especially for cases as you describe).
If you're keeping a local cache of sorts, then you might also want to consider bookmarks and resource versions in addition to their handling to reduce the stress (both locally and to the API server).
https://kubernetes.io/docs/reference/using-api/api-concepts/#watch-bookmarks https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/956-watch-bookmark/README.md
Describe the bug
We use watches extensively and recently we have seen increase of incidents where the CRD client
watch(Watcher)
call never completes.While I suspect that the root cause may be some kind of networking problem that prevents the call from succeeding, the fact that the client now sets infinite timeouts (https://github.com/fabric8io/kubernetes-client/pull/5206) means, that the client never recovers from this condition.
Below is the thread stack when the freeze occurs.
Fabric8 Kubernetes Client version
6.7.2
Steps to reproduce
There are no reliable steps to reproduce. It happens from time to time and possibly appears to correlate with networking issue in the environment.
Expected behavior
The client should not freeze. While I understand that watches are expected to be long-lived, I would expect some kind of heart beat mechanism to be in place to ensure that the watch call completes (possibly with error).
Runtime
Kubernetes (vanilla)
Kubernetes API Server version
1.23
Environment
Linux
Fabric8 Kubernetes Client Logs
No response
Additional context
No response