Kubernetes client watch(Watcher) call never finishes

janotav commented 1 week ago

Describe the bug

We use watches extensively and recently we have seen increase of incidents where the CRD client watch(Watcher) call never completes.

While I suspect that the root cause may be some kind of networking problem that prevents the call from succeeding, the fact that the client now sets infinite timeouts (https://github.com/fabric8io/kubernetes-client/pull/5206) means, that the client never recovers from this condition.

Below is the thread stack when the freeze occurs.

"Thread-163" #237803 daemon prio=5 os_prio=0 cpu=0.95ms elapsed=23420.42s tid=0x00007f5f04004f40 nid=0x295e waiting on condition  [0x00007f5f206ad000]
   java.lang.Thread.State: WAITING (parking)
        at jdk.internal.misc.Unsafe.park(java.base@17.0.10/Native Method)
        - parking to wait for  <0x00000000d9de8f88> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.park(java.base@17.0.10/LockSupport.java:211)
        at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.10/CompletableFuture.java:1864)
        at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.10/ForkJoinPool.java:3465)
        at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.10/ForkJoinPool.java:3436)
        at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.10/CompletableFuture.java:1898)
        at java.util.concurrent.CompletableFuture.get(java.base@17.0.10/CompletableFuture.java:2072)
        at io.fabric8.kubernetes.client.utils.Utils.waitUntilReady(Utils.java:162)
        at io.fabric8.kubernetes.client.utils.Utils.waitUntilReadyOrFail(Utils.java:185)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.watch(BaseOperation.java:611)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.watch(BaseOperation.java:598)
        at oracle.aiapps.routing.discovery.idcs.IDCSEventDiscovery.doStart(IDCSEventDiscovery.java:61)
        at oracle.aiapps.routing.discovery.AbstractEventDiscovery.restart(AbstractEventDiscovery.java:103)
        at oracle.aiapps.routing.discovery.AbstractEventDiscovery.lambda$throttledStart$0(AbstractEventDiscovery.java:68)
        at oracle.aiapps.routing.discovery.AbstractEventDiscovery$$Lambda$1614/0x00007f5fb92dda08.run(Unknown Source)
        at java.lang.Thread.run(java.base@17.0.10/Thread.java:842)

Fabric8 Kubernetes Client version

6.7.2

Steps to reproduce

There are no reliable steps to reproduce. It happens from time to time and possibly appears to correlate with networking issue in the environment.

Expected behavior

The client should not freeze. While I understand that watches are expected to be long-lived, I would expect some kind of heart beat mechanism to be in place to ensure that the watch call completes (possibly with error).

Runtime

Kubernetes (vanilla)

Kubernetes API Server version

1.23

Environment

Linux

Fabric8 Kubernetes Client Logs

No response

Additional context

No response

manusa commented 4 days ago

Is this still happening with the latest version of the client?

We've seen that watchers might become stale or dead from the kube-api side. This means that events are no longer emitted but the connection remains open.

Both client-go and the Fabric8 Kubernetes Client have mechanisms in place to avoid this in informers by adding an artificial timeout to all watcher connections:

https://github.com/kubernetes/client-go/blob/b03e5b8438ce5abf36bac817490639abfbcd0441/tools/cache/reflector.go#L430-L432

https://github.com/fabric8io/kubernetes-client/blob/c470c55fe31fbab169dead5461d0e293a0419409/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/informers/impl/cache/Reflector.java#L222-L229

You can achieve the same result by using ListOptionsBuilder().withTimeoutSeconds(... when opening your watcher. You should then self-recover from the timeout and restart the watcher. This will ensure that your watchers are always listening despite any networking or cluster problems.

janotav commented 3 days ago

The issue doesn't occur on local/testing environment, so I can't easily verify that it doesn't occur with latest version.

Can you please explain what this ListOptions.timeoutSeconds effectively does? When I tested this locally with a very low value (5 seconds). The watch appears to process events way beyond this interval. What kind of timeout does it impose?

manusa commented 3 days ago

Can you please explain what this ListOptions.timeoutSeconds effectively does? When I tested this locally with a very low value (5 seconds). The watch appears to process events way beyond this interval. What kind of timeout does it impose?

That shouldn't be the case, the kube-api server should actually be closing the connection upstream after the given timeout.

You can check the Kubernetes API reference for more information:

https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.29/#watch-pod-v1-core

Query Parameters

Parameter Description

timeoutSeconds Timeout for the list/watch call. This limits the duration of the call, regardless of any activity or inactivity.

Parameter	Description
timeoutSeconds	Timeout for the list/watch call. This limits the duration of the call, regardless of any activity or inactivity.

janotav commented 3 days ago

When invoking the watch through curl, it behaves as described. I will need to figure out why I am seemingly seeing different behavior through the library.

Is there any performance penalty related to this switch from "infinite" watches to periodically restarted ones? Clearly in order to make this mechanism responsive, the timeout needs to be sufficiently short and I wonder if too short timeout could potentially back-fire due to frequently repeated watches call processing on the server.

manusa commented 3 days ago

AFAIR client-go establishes a default timeout of 5 minutes. In addition, you'd want to add some random jitter time (especially for cases as you describe).

If you're keeping a local cache of sorts, then you might also want to consider bookmarks and resource versions in addition to their handling to reduce the stress (both locally and to the API server).

https://kubernetes.io/docs/reference/using-api/api-concepts/#watch-bookmarks https://github.com/kubernetes/enhancements/blob/master/keps/sig-api-machinery/956-watch-bookmark/README.md

fabric8io / kubernetes-client