client.pods().inNamespace("x").list(); timeouts in version 6.6.0

MikaelUlvesjo commented 1 year ago

Describe the bug

I can not get list of pods in version 6.6.0 but it works in 6.5.0 and on 6.5.1 it happens on a few percent of the attempts. I have tried to set log level to trace to get more information but that do not add any additional information regarding what is happening.

This is the request that timeout: client.pods().inNamespace("x").list();

Kubernetes version not in list: Client Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.1", GitCommit:"e01e27ba1fcd49adae44ef0984abfc6ee999c99a", GitTreeState:"clean", BuildDate:"2023-03-13T18:01:50Z", GoVersion:"go1.19.5", Compiler:"gc", Platform:"linux/amd64"}

Fabric8 Kubernetes Client version

6.6.0

Steps to reproduce

create a client then run: client.pods().inNamespace("x").list();

Expected behavior

Get a PodList object.

Runtime

Kubernetes (vanilla)

Kubernetes API Server version

next (development version)

Environment

Linux

Fabric8 Kubernetes Client Logs

class io.fabric8.kubernetes.client.KubernetesClientException: An error has occurred.
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:129)
        at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:122)
        at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:530)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:424)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:392)
        at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:93)
        ...
    Caused By:
        class java.util.concurrent.TimeoutException: null
            at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960)
            at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095)
            at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
            at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:424)
            at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:392)
            at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:93)
           ...

Additional context

The namespace is containing 104 pods. Pod Security Standards is enabled in the namespace at baseline level.

manusa commented 1 year ago

The TimeoutException is probably hiding something else (#5095)

Could you please try with the newly released v6.6.1?

MikaelUlvesjo commented 1 year ago

Same issue with version 6.6.1. Switched to 6.6.1 and it was not working, switched back to 6.5.0 it worked then back to 6.6.1 and still not working.

shawkins commented 1 year ago

What is your setting for the Config.requestTimeout? If you haven't explicitly set it then it should be 10 seconds. What about the retry backoff settings? This information and the a debug log from 6.6.1 would help us determine what is occurring in your case.

MikaelUlvesjo commented 1 year ago

on 6.6.1. backoff limit is set to 1 (to test this) backof intervall is 100, and actually it never timeout, I have been waiting now for 30 minutes but it is just stuck. I will enable trace logging on io.fabric8 level, is there any other package that have logs that can be helpfull?

MikaelUlvesjo commented 1 year ago

No logging at all with trace enabled for the client.pods().inNamespace("x").list(); command: This is the code that I run:

log.info(marker, "getPodList 1"); final var listNamespacedPod = client.pods().inNamespace("x").list(); log.info(marker, "getPodList 2");

The log only prints "getPodList 1" the it is quiet and I aborted it after 15 minutes. When I try the same with services client.services().inNamespace("x").list(); i get some log (anonymised logs):

2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor -HTTP START- 2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor > GET https://x.x.x.x:yyyy/api/v1/namespaces/x/services 2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor > User-Agent: fabric8-kubernetes-client/6.6.1 2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor < 200 OK 2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor < audit-id: 94df3b11-7ced-4ef8-9b36-9f2a6465c50f 2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor < cache-control: no-cache, private 2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor < content-type: application/json 2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor < date: Thu, 11 May 2023 12:14:32 GMT 2023-05-11 14:14:32,373 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor < strict-transport-security: max-age=15724800; includeSubDomains 2023-05-11 14:14:32,374 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor < x-kubernetes-pf-flowschema-uid: c8ff79e6-5b45-4706-b90d-b80342cd1cf2 2023-05-11 14:14:32,374 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor < x-kubernetes-pf-prioritylevel-uid: ec0c7da2-b760-44a9-8cb2-05a91bee9222 2023-05-11 14:14:32,383 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor {zzzz} 2023-05-11 14:14:32,385 TRACE io.fabric8.kubernetes.client.http.HttpLoggingInterceptor -HTTP END-

shawkins commented 1 year ago

The log only prints "getPodList 1" the it is quiet and I aborted it after 15 minutes.

For 6.6.1 what is your requestTimeout set to? At the very least I would have expected the request to timeout. If it did not presumably this means that the initial response has been received, but that the full response has not been processed. The logic is simply building up a byte[] of the response, so there's not really a reason why this won't complete.

What is your httpclient - there will probably be additional logger there that's helpful? If it's vertx and some other operation is blocking the io thread needed to process these results, then you'll see behavior like this. Can you provide or examine a thread dump to confirm if that is the case?

MikaelUlvesjo commented 1 year ago

OkHttp is the httpclient. requestTimeout is unchanged so it is default.

Here are the threads:

Thread list: org.eclipse.jdt.internal.junit.runner.RemoteTestRunner at localhost:44711
Thread [main] (Running) Thread [ReaderThread] (Running) Thread [Timer-0] (Running)
Daemon Thread [OkHttp TaskRunner] (Running)

Thread [main] (Suspended)
owns: Class (MyClass) (id=27)
Unsafe.park(boolean, long) line: not available [native method]
LockSupport.park(Object) line: 211
CompletableFuture$Signaller.block() line: 1864
ForkJoinPool.unmanagedBlock(ForkJoinPool$ManagedBlocker) line: 3463 ForkJoinPool.managedBlock(ForkJoinPool$ManagedBlocker) line: 3434
CompletableFuture.waitingGet(boolean) line: 1898 CompletableFuture.get() line: 2072
PodOperationsImpl(OperationSupport).waitForResult(CompletableFuture) line: 501
PodOperationsImpl(BaseOperation<T,L,R>).list(ListOptions) line: 424 PodOperationsImpl(BaseOperation<T,L,R>).list() line: 392
PodOperationsImpl(BaseOperation<T,L,R>).list() line: 93 ...

Thread [ReaderThread] (Suspended)
owns: InputStreamReader (id=249)
SocketDispatcher.read0(FileDescriptor, long, int) line: not available [native method]
SocketDispatcher.read(FileDescriptor, long, int) line: 47
NioSocketImpl.tryRead(FileDescriptor, byte[], int, int) line: 261
NioSocketImpl.implRead(byte[], int, int) line: 312
NioSocketImpl.read(byte[], int, int) line: 350
NioSocketImpl$1.read(byte[], int, int) line: 803
Socket$SocketInputStream.read(byte[], int, int) line: 966
StreamDecoder.readBytes() line: 270 StreamDecoder.implRead(char[], int, int) line: 313
StreamDecoder.read(char[], int, int) line: 188
InputStreamReader.read(char[], int, int) line: 177
BufferedReader.fill() line: 162 BufferedReader.readLine(boolean, boolean[]) line: 329
BufferedReader.readLine() line: 396 RemoteTestRunner$ReaderThread.run() line: 152

Thread [Timer-0] (Suspended)
waiting for: TaskQueue (id=264)
Object.wait(long) line: not available [native method]
TimerThread.mainLoop() line: 563
TimerThread.run() line: 516

Daemon Thread [OkHttp TaskRunner] (Suspended)
waiting for: TaskRunner (id=261)
Object.wait(long) line: not available [native method]
TaskRunner(Object).wait(long, int) line: 472
TaskRunner.kt line: 294 TaskRunner.kt line: 218 TaskRunner.kt line: 59
ThreadPoolExecutor.runWorker(ThreadPoolExecutor$Worker) line: 1136
ThreadPoolExecutor$Worker.run() line: 635
Thread.run() line: 833

Added trace logging for okhttp3 and com.squareup.okhttp3 but no more logs then before after pod list is called.

shawkins commented 1 year ago

Still no good indication of what you are seeing unfortunately. If there were a stuck thread you'd expect to see something in OkHttpClientImpl.doConsume - that is responsible for pull results from the okhttp source and delivering them downstream for building the byte[] that will get parsed as the response.

Other than trying a different httpclient to compare behavior this will take a reproducer or the use of debugger to investigate further.

MikaelUlvesjo commented 1 year ago

If I can send you an flight recording (jfr file) will that help? I have a file but since it contains sensitive information regarding our network I need to get an approval or find a way to anonymize the information before I can send it to you but I do not think I can upload it to github so if there is a way that we can talk/chat directly I might be able to share the information to you.

MikaelUlvesjo commented 1 year ago

I tested it now against a namespace containing a lot fewer pods and then it worked. When I manually get the json from https://x.x.x.x:yyyy/api/v1/namespaces/x/pods i get a file that is 6.8Mb in size and have 198440 lines, the one from the namespace that works is 1.5Mb and have 46472 lines. I checked them both and they are valid json files.

shawkins commented 1 year ago

I'm not aware of a size related limitation with building the result. I'll try a similar reproducer locally with okhttp and see if it shows the same behavior.

shawkins commented 1 year ago

Without additional factors, size isn't an issue locally. I can transfer/parse a list of 3,000,000 small configmaps totaling over 300MB without an problem. For even larger sizes I am seeing some thrashing when jackson is parsing the result - but in the thread dump I do see a thread doing that processing, which doesn't match your scenario.

If I can send you an flight recording (jfr file) will that help?

Probably. The next thing I would look at is what the ByteArrayBodyHandler is doing - is onBodyDone ever called for this request? If it has been, then your problem is happening downstream with parsing. If it has not, then the problem is with the transfer of the response.

MikaelUlvesjo commented 1 year ago

onBodyDone is never called.

This is the thread dump:

"main" #1 prio=5 os_prio=0 cpu=2758.14ms elapsed=248.09s tid=0x00007f2580026f70 nid=0x20cc waiting on condition [0x00007f258aa72000] java.lang.Thread.State: WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.6/Native Method)

parking to wait for <0x0000000625e095e0> (a java.util.concurrent.CompletableFuture$Signaller) at java.util.concurrent.locks.LockSupport.park(java.base@17.0.6/LockSupport.java:211) at java.util.concurrent.CompletableFuture$Signaller.block(java.base@17.0.6/CompletableFuture.java:1864) at java.util.concurrent.ForkJoinPool.unmanagedBlock(java.base@17.0.6/ForkJoinPool.java:3463) at java.util.concurrent.ForkJoinPool.managedBlock(java.base@17.0.6/ForkJoinPool.java:3434) at java.util.concurrent.CompletableFuture.waitingGet(java.base@17.0.6/CompletableFuture.java:1898) at java.util.concurrent.CompletableFuture.get(java.base@17.0.6/CompletableFuture.java:2072) at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:501) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:424) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:392) at io.fabric8.kubernetes.client.dsl.internal.BaseOperation.list(BaseOperation.java:93) at TestClass.testListPods(TestClass.java:19) ...

"Reference Handler" #2 daemon prio=10 os_prio=0 cpu=0.85ms elapsed=248.07s tid=0x00007f258013d290 nid=0x20d4 waiting on condition [0x00007f25492fc000] java.lang.Thread.State: RUNNABLE at java.lang.ref.Reference.waitForReferencePendingList(java.base@17.0.6/Native Method) at java.lang.ref.Reference.processPendingReferences(java.base@17.0.6/Reference.java:253) at java.lang.ref.Reference$ReferenceHandler.run(java.base@17.0.6/Reference.java:215)

"Finalizer" #3 daemon prio=8 os_prio=0 cpu=0.15ms elapsed=248.07s tid=0x00007f258013e670 nid=0x20d5 in Object.wait() [0x00007f25491fb000] java.lang.Thread.State: WAITING (on object monitor) at java.lang.Object.wait(java.base@17.0.6/Native Method)

waiting on <0x000000060a203270> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(java.base@17.0.6/ReferenceQueue.java:155)
locked <0x000000060a203270> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(java.base@17.0.6/ReferenceQueue.java:176) at java.lang.ref.Finalizer$FinalizerThread.run(java.base@17.0.6/Finalizer.java:172)

"Signal Dispatcher" #4 daemon prio=9 os_prio=0 cpu=0.21ms elapsed=248.06s tid=0x00007f2580144be0 nid=0x20d6 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Service Thread" #5 daemon prio=9 os_prio=0 cpu=0.83ms elapsed=248.06s tid=0x00007f2580145ea0 nid=0x20d7 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Monitor Deflation Thread" #6 daemon prio=9 os_prio=0 cpu=4.18ms elapsed=248.06s tid=0x00007f25801472b0 nid=0x20d8 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"C2 CompilerThread0" #7 daemon prio=9 os_prio=0 cpu=4509.13ms elapsed=248.06s tid=0x00007f2580148bb0 nid=0x20d9 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE No compile task

"C1 CompilerThread0" #9 daemon prio=9 os_prio=0 cpu=1320.18ms elapsed=248.06s tid=0x00007f258014a0e0 nid=0x20da waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE No compile task

"Sweeper thread" #10 daemon prio=9 os_prio=0 cpu=19.82ms elapsed=248.06s tid=0x00007f258014b550 nid=0x20db runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Common-Cleaner" #11 daemon prio=8 os_prio=0 cpu=1.53ms elapsed=247.98s tid=0x00007f2580157cc0 nid=0x20de in Object.wait() [0x00007f25487cb000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(java.base@17.0.6/Native Method)

waiting on <0x000000060a2176d8> (a java.lang.ref.ReferenceQueue$Lock) at java.lang.ref.ReferenceQueue.remove(java.base@17.0.6/ReferenceQueue.java:155)
locked <0x000000060a2176d8> (a java.lang.ref.ReferenceQueue$Lock) at jdk.internal.ref.CleanerImpl.run(java.base@17.0.6/CleanerImpl.java:140) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833) at jdk.internal.misc.InnocuousThread.run(java.base@17.0.6/InnocuousThread.java:162)

"JDWP Transport Listener: dt_socket" #12 daemon prio=10 os_prio=0 cpu=23.83ms elapsed=247.95s tid=0x00007f258019c640 nid=0x20e0 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"JDWP Event Helper Thread" #13 daemon prio=10 os_prio=0 cpu=3.05ms elapsed=247.94s tid=0x00007f25801afef0 nid=0x20e3 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"JDWP Command Reader" #14 daemon prio=10 os_prio=0 cpu=7.22ms elapsed=247.94s tid=0x00007f2518000f60 nid=0x20e4 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"Notification Thread" #15 daemon prio=9 os_prio=0 cpu=2.93ms elapsed=247.25s tid=0x00007f2581a08ce0 nid=0x20ec runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"ReaderThread" #17 prio=5 os_prio=0 cpu=0.75ms elapsed=247.07s tid=0x00007f2581a964b0 nid=0x20ee runnable [0x00007f250b1b9000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.SocketDispatcher.read0(java.base@17.0.6/Native Method) at sun.nio.ch.SocketDispatcher.read(java.base@17.0.6/SocketDispatcher.java:47) at sun.nio.ch.NioSocketImpl.tryRead(java.base@17.0.6/NioSocketImpl.java:261) at sun.nio.ch.NioSocketImpl.implRead(java.base@17.0.6/NioSocketImpl.java:312) at sun.nio.ch.NioSocketImpl.read(java.base@17.0.6/NioSocketImpl.java:350) at sun.nio.ch.NioSocketImpl$1.read(java.base@17.0.6/NioSocketImpl.java:803) at java.net.Socket$SocketInputStream.read(java.base@17.0.6/Unknown Source) at sun.nio.cs.StreamDecoder.readBytes(java.base@17.0.6/StreamDecoder.java:270) at sun.nio.cs.StreamDecoder.implRead(java.base@17.0.6/StreamDecoder.java:313) at sun.nio.cs.StreamDecoder.read(java.base@17.0.6/StreamDecoder.java:188)

locked <0x000000060fc5cae8> (a java.io.InputStreamReader) at java.io.InputStreamReader.read(java.base@17.0.6/InputStreamReader.java:177) at java.io.BufferedReader.fill(java.base@17.0.6/BufferedReader.java:162) at java.io.BufferedReader.readLine(java.base@17.0.6/BufferedReader.java:329)
locked <0x000000060fc5cae8> (a java.io.InputStreamReader) at java.io.BufferedReader.readLine(java.base@17.0.6/BufferedReader.java:396) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner$ReaderThread.run(RemoteTestRunner.java:152)

"Attach Listener" #18 daemon prio=9 os_prio=0 cpu=365.45ms elapsed=246.69s tid=0x00007f252c000e20 nid=0x20f3 runnable [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"RMI TCP Accept-0" #19 daemon prio=9 os_prio=0 cpu=1.80ms elapsed=246.38s tid=0x00007f24e80671a0 nid=0x20f8 runnable [0x00007f250abac000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.Net.accept(java.base@17.0.6/Native Method) at sun.nio.ch.NioSocketImpl.accept(java.base@17.0.6/NioSocketImpl.java:755) at java.net.ServerSocket.implAccept(java.base@17.0.6/ServerSocket.java:675) at java.net.ServerSocket.platformImplAccept(java.base@17.0.6/ServerSocket.java:641) at java.net.ServerSocket.implAccept(java.base@17.0.6/ServerSocket.java:617) at java.net.ServerSocket.implAccept(java.base@17.0.6/ServerSocket.java:574) at java.net.ServerSocket.accept(java.base@17.0.6/ServerSocket.java:532) at sun.management.jmxremote.LocalRMIServerSocketFactory$1.accept(jdk.management.agent@17.0.6/LocalRMIServerSocketFactory.java:52) at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.executeAcceptLoop(java.rmi@17.0.6/TCPTransport.java:413) at sun.rmi.transport.tcp.TCPTransport$AcceptLoop.run(java.rmi@17.0.6/TCPTransport.java:377) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833)

"JFR Recorder Thread" #20 daemon prio=5 os_prio=0 cpu=23.97ms elapsed=180.24s tid=0x00007f24e8f40cd0 nid=0x2174 waiting on condition [0x0000000000000000] java.lang.Thread.State: RUNNABLE

"JFR Periodic Tasks" #21 daemon prio=9 os_prio=0 cpu=279.39ms elapsed=180.04s tid=0x00007f24e8fda340 nid=0x2175 waiting for monitor entry [0x00007f250a9aa000] java.lang.Thread.State: BLOCKED (on object monitor) at jdk.jfr.internal.PlatformRecorder.periodicTask(jdk.jfr@17.0.6/PlatformRecorder.java:499)

waiting to lock <0x000000060fc5d2b8> (a jdk.jfr.internal.PlatformRecorder) at jdk.jfr.internal.PlatformRecorder.lambda$startDiskMonitor$1(jdk.jfr@17.0.6/PlatformRecorder.java:448) at jdk.jfr.internal.PlatformRecorder$$Lambda$481/0x0000000800c5b330.run(jdk.jfr@17.0.6/Unknown Source) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833)

"JFR Recording Scheduler" #24 daemon prio=9 os_prio=0 cpu=15.37ms elapsed=180.00s tid=0x00007f24d40031e0 nid=0x2178 waiting on condition [0x00007f2509dfb000] java.lang.Thread.State: RUNNABLE at jdk.jfr.internal.JVM.emitEvent(jdk.jfr@17.0.6/Native Method) at jdk.jfr.internal.RequestEngine$RequestHook.execute(jdk.jfr@17.0.6/RequestEngine.java:70) at jdk.jfr.internal.RequestEngine.doChunk(jdk.jfr@17.0.6/RequestEngine.java:186) at jdk.jfr.internal.RequestEngine.doChunkEnd(jdk.jfr@17.0.6/RequestEngine.java:175) at jdk.jfr.internal.PlatformRecorder.stop(jdk.jfr@17.0.6/PlatformRecorder.java:323)

locked <0x000000060fc5d2b8> (a jdk.jfr.internal.PlatformRecorder) at jdk.jfr.internal.PlatformRecording.stop(jdk.jfr@17.0.6/PlatformRecording.java:166)
locked <0x000000060fc5d2b8> (a jdk.jfr.internal.PlatformRecorder) at jdk.jfr.internal.PlatformRecording$2.run(jdk.jfr@17.0.6/PlatformRecording.java:667) at java.util.TimerThread.mainLoop(java.base@17.0.6/Timer.java:566) at java.util.TimerThread.run(java.base@17.0.6/Timer.java:516)

"RMI TCP Connection(1)-10.120.185.4" #25 daemon prio=9 os_prio=0 cpu=112.66ms elapsed=154.46s tid=0x00007f24dc002b10 nid=0x21aa in Object.wait() [0x00007f25481c0000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(java.base@17.0.6/Native Method)

waiting on at com.sun.jmx.remote.internal.ArrayNotificationBuffer.fetchNotifications(java.management@17.0.6/ArrayNotificationBuffer.java:449)
locked <0x000000060fc5d7f0> (a com.sun.jmx.remote.internal.ArrayNotificationBuffer) at com.sun.jmx.remote.internal.ArrayNotificationBuffer$ShareBuffer.fetchNotifications(java.management@17.0.6/ArrayNotificationBuffer.java:227) at com.sun.jmx.remote.internal.ServerNotifForwarder.fetchNotifs(java.management@17.0.6/ServerNotifForwarder.java:275) at javax.management.remote.rmi.RMIConnectionImpl$4.run(java.management.rmi@17.0.6/RMIConnectionImpl.java:1271) at javax.management.remote.rmi.RMIConnectionImpl$4.run(java.management.rmi@17.0.6/RMIConnectionImpl.java:1269) at javax.management.remote.rmi.RMIConnectionImpl.fetchNotifications(java.management.rmi@17.0.6/RMIConnectionImpl.java:1275) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(java.base@17.0.6/Native Method) at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(java.base@17.0.6/NativeMethodAccessorImpl.java:77) at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(java.base@17.0.6/DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(java.base@17.0.6/Method.java:568) at sun.rmi.server.UnicastServerRef.dispatch(java.rmi@17.0.6/UnicastServerRef.java:360) at sun.rmi.transport.Transport$1.run(java.rmi@17.0.6/Transport.java:200) at sun.rmi.transport.Transport$1.run(java.rmi@17.0.6/Transport.java:197) at java.security.AccessController.executePrivileged(java.base@17.0.6/AccessController.java:807) at java.security.AccessController.doPrivileged(java.base@17.0.6/AccessController.java:712) at sun.rmi.transport.Transport.serviceCall(java.rmi@17.0.6/Transport.java:196) at sun.rmi.transport.tcp.TCPTransport.handleMessages(java.rmi@17.0.6/TCPTransport.java:587) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(java.rmi@17.0.6/TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(java.rmi@17.0.6/TCPTransport.java:705) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$495/0x0000000800d81818.run(java.rmi@17.0.6/Unknown Source) at java.security.AccessController.executePrivileged(java.base@17.0.6/AccessController.java:776) at java.security.AccessController.doPrivileged(java.base@17.0.6/AccessController.java:399) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(java.rmi@17.0.6/TCPTransport.java:704) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.6/ThreadPoolExecutor.java:1136) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.6/ThreadPoolExecutor.java:635) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833)

"RMI Scheduler(0)" #26 daemon prio=9 os_prio=0 cpu=0.41ms elapsed=154.43s tid=0x00007f25041d9870 nid=0x21ab waiting on condition [0x00007f2509efc000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.6/Native Method)

parking to wait for <0x000000060fcaca70> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.6/LockSupport.java:252) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(java.base@17.0.6/AbstractQueuedSynchronizer.java:1672) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@17.0.6/ScheduledThreadPoolExecutor.java:1182) at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(java.base@17.0.6/ScheduledThreadPoolExecutor.java:899) at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@17.0.6/ThreadPoolExecutor.java:1062) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.6/ThreadPoolExecutor.java:1122) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.6/ThreadPoolExecutor.java:635) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833)

"JMX server connection timeout 27" #27 daemon prio=9 os_prio=0 cpu=4.69ms elapsed=154.43s tid=0x00007f2504154eb0 nid=0x21ac in Object.wait() [0x00007f2509cfa000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(java.base@17.0.6/Native Method)

waiting on at com.sun.jmx.remote.internal.ServerCommunicatorAdmin$Timeout.run(java.management@17.0.6/ServerCommunicatorAdmin.java:171)
locked <0x000000060fcacec0> (a [I) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833)

"RMI TCP Connection(2)-10.120.185.4" #28 daemon prio=9 os_prio=0 cpu=490.97ms elapsed=154.30s tid=0x00007f24dc007690 nid=0x21b0 runnable [0x00007f2509af7000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.Net.poll(java.base@17.0.6/Native Method) at sun.nio.ch.NioSocketImpl.park(java.base@17.0.6/NioSocketImpl.java:181) at sun.nio.ch.NioSocketImpl.timedRead(java.base@17.0.6/NioSocketImpl.java:285) at sun.nio.ch.NioSocketImpl.implRead(java.base@17.0.6/NioSocketImpl.java:309) at sun.nio.ch.NioSocketImpl.read(java.base@17.0.6/NioSocketImpl.java:350) at sun.nio.ch.NioSocketImpl$1.read(java.base@17.0.6/NioSocketImpl.java:803) at java.net.Socket$SocketInputStream.read(java.base@17.0.6/Socket.java:966) at java.io.BufferedInputStream.fill(java.base@17.0.6/BufferedInputStream.java:244) at java.io.BufferedInputStream.read(java.base@17.0.6/BufferedInputStream.java:263)

locked <0x000000060fcaf190> (a java.io.BufferedInputStream) at java.io.FilterInputStream.read(java.base@17.0.6/FilterInputStream.java:82) at sun.rmi.transport.tcp.TCPTransport.handleMessages(java.rmi@17.0.6/TCPTransport.java:569) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(java.rmi@17.0.6/TCPTransport.java:828) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(java.rmi@17.0.6/TCPTransport.java:705) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler$$Lambda$495/0x0000000800d81818.run(java.rmi@17.0.6/Unknown Source) at java.security.AccessController.executePrivileged(java.base@17.0.6/AccessController.java:776) at java.security.AccessController.doPrivileged(java.base@17.0.6/AccessController.java:399) at sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(java.rmi@17.0.6/TCPTransport.java:704) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.6/ThreadPoolExecutor.java:1136) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.6/ThreadPoolExecutor.java:635) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833)

"OkHttp TaskRunner" #31 daemon prio=5 os_prio=0 cpu=79.20ms elapsed=146.01s tid=0x00007f24fc0a20d0 nid=0x21cf waiting on condition [0x00007f2508307000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.6/Native Method)

parking to wait for <0x000000060fdd12a8> (a java.util.concurrent.SynchronousQueue$TransferStack) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.6/LockSupport.java:252) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@17.0.6/SynchronousQueue.java:401) at java.util.concurrent.SynchronousQueue.poll(java.base@17.0.6/SynchronousQueue.java:903) at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@17.0.6/ThreadPoolExecutor.java:1061) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.6/ThreadPoolExecutor.java:1122) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.6/ThreadPoolExecutor.java:635) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833)

"OkHttp TaskRunner" #33 daemon prio=5 os_prio=0 cpu=3.05ms elapsed=146.00s tid=0x00007f250c007010 nid=0x21d1 waiting on condition [0x00007f25085fc000] java.lang.Thread.State: TIMED_WAITING (parking) at jdk.internal.misc.Unsafe.park(java.base@17.0.6/Native Method)

parking to wait for <0x000000060fdd12a8> (a java.util.concurrent.SynchronousQueue$TransferStack) at java.util.concurrent.locks.LockSupport.parkNanos(java.base@17.0.6/LockSupport.java:252) at java.util.concurrent.SynchronousQueue$TransferStack.transfer(java.base@17.0.6/SynchronousQueue.java:401) at java.util.concurrent.SynchronousQueue.poll(java.base@17.0.6/SynchronousQueue.java:903) at java.util.concurrent.ThreadPoolExecutor.getTask(java.base@17.0.6/ThreadPoolExecutor.java:1061) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.6/ThreadPoolExecutor.java:1122) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.6/ThreadPoolExecutor.java:635) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833)

"OkHttp TaskRunner" #35 daemon prio=5 os_prio=0 cpu=1.13ms elapsed=145.98s tid=0x00007f2518010250 nid=0x21d3 in Object.wait() [0x00007f2508206000] java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(java.base@17.0.6/Native Method)

waiting on at java.lang.Object.wait(java.base@17.0.6/Object.java:472) at okhttp3.internal.concurrent.TaskRunner$RealBackend.coordinatorWait(TaskRunner.kt:294) at okhttp3.internal.concurrent.TaskRunner.awaitTaskToRun(TaskRunner.kt:218) at okhttp3.internal.concurrent.TaskRunner$runnable$1.run(TaskRunner.kt:59)
locked <0x000000060fdd11f0> (a okhttp3.internal.concurrent.TaskRunner) at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@17.0.6/ThreadPoolExecutor.java:1136) at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@17.0.6/ThreadPoolExecutor.java:635) at java.lang.Thread.run(java.base@17.0.6/Thread.java:833)

"VM Thread" os_prio=0 cpu=36.04ms elapsed=248.07s tid=0x00007f25801391e0 nid=0x20d3 runnable

"GC Thread#0" os_prio=0 cpu=38.15ms elapsed=248.09s tid=0x00007f2580053be0 nid=0x20ce runnable

"GC Thread#1" os_prio=0 cpu=35.95ms elapsed=247.49s tid=0x00007f2540005120 nid=0x20e8 runnable

"GC Thread#2" os_prio=0 cpu=45.39ms elapsed=247.49s tid=0x00007f2540005b50 nid=0x20e9 runnable

"GC Thread#3" os_prio=0 cpu=28.77ms elapsed=247.49s tid=0x00007f2540006580 nid=0x20ea runnable

"G1 Main Marker" os_prio=0 cpu=0.58ms elapsed=248.08s tid=0x00007f2580064d10 nid=0x20cf runnable

"G1 Conc#0" os_prio=0 cpu=32.97ms elapsed=248.08s tid=0x00007f2580065c70 nid=0x20d0 runnable

"G1 Refine#0" os_prio=0 cpu=1.14ms elapsed=248.08s tid=0x00007f258010a8a0 nid=0x20d1 runnable

"G1 Refine#1" os_prio=0 cpu=0.03ms elapsed=247.48s tid=0x00007f2544000ce0 nid=0x20eb runnable

"G1 Service" os_prio=0 cpu=32.46ms elapsed=248.08s tid=0x00007f258010b790 nid=0x20d2 runnable

"VM Periodic Task Thread" os_prio=0 cpu=102.13ms elapsed=247.25s tid=0x00007f2580028360 nid=0x20ed waiting on condition

JNI global refs: 90, weak refs: 135

In the same namespace It fails to get secrets that is about 9Mb but it manages to get configmaps that is 700kb and services that is 200kb

It actually freezes on client.raw("https://x.x.x.x:yyy/api/v1/namespaces/xxx/pods"); that have to reduce the code involved.

I'm running the tests using Java 17.0.6 and Junit5 on a linux environment if it matters.

shawkins commented 1 year ago

The next thought is that somewhere OkHttpClientImpl consume (a problem with the executor) or doConsume (a problem reading the okhttp source) is throwing an Error rather than an Exception - those methods are only catching Exception. That would give the appearance of things hanging from the client side - but you'd expect that to be logged by the default uncaught exception handling. Are you customizing the okhttp dispatcher executorservice?

MikaelUlvesjo commented 1 year ago

At the moment I'm only running a minimalistic test case that only imports the kube client so no modifications that I can think of.

MikaelUlvesjo commented 1 year ago

I found the problem, it is a dependency conflict, when I created a new empty project and run the test from there it worked and I added one dep at the time and run the test until it froze then added some excludes and continued. The conflicts I found was on: com.squareup.okhttp3 and com.squareup.okio.

Thank you for your help! You can close this now.

shawkins commented 1 year ago

Interesting that there's some extremely subtle behavioral change rather than a more noticeable error. If possible can you provide a pom or the dependencies that caused this, so that we can see if there's something more defensive that would be easy to include?

MikaelUlvesjo commented 1 year ago

Sorry I can not provide you with an pom file as it is using internal libs but here is the parts of dep tree and pom that in the end helped me find the problem, I have renamed the library:library and removed all non relevant deps from it here as it is an internal one. I had to exclude both okhttp3 and okio from library:library to solve the problem.

[INFO] test:test:jar:1.0.0-SNAPSHOT
[INFO] +- library:library:jar:0.0.1:compile
[INFO] |  +- com.squareup.okhttp3:okhttp:jar:4.9.3:compile
[INFO] |  +- com.squareup.okio:okio:jar:1.15.0:compile
[INFO] +- io.fabric8:kubernetes-client:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-client-api:jar:6.6.1:compile
[INFO] |  |  +- io.fabric8:kubernetes-model-gatewayapi:jar:6.6.1:compile
[INFO] |  |  +- io.fabric8:kubernetes-model-resource:jar:6.6.1:compile
[INFO] |  |  +- io.fabric8:kubernetes-model-flowcontrol:jar:6.6.1:compile
[INFO] |  |  +- io.fabric8:kubernetes-model-node:jar:6.6.1:compile
[INFO] |  |  +- org.snakeyaml:snakeyaml-engine:jar:2.6:compile
[INFO] |  |  +- com.fasterxml.jackson.dataformat:jackson-dataformat-yaml:jar:2.15.0:compile
[INFO] |  |  \- com.fasterxml.jackson.datatype:jackson-datatype-jsr310:jar:2.15.0:compile
[INFO] |  +- io.fabric8:kubernetes-httpclient-okhttp:jar:6.6.1:runtime
[INFO] |  |  \- com.squareup.okhttp3:logging-interceptor:jar:4.9.3:runtime
[INFO] |  |     \- org.jetbrains.kotlin:kotlin-stdlib-jdk8:jar:1.7.10:runtime
[INFO] |  |        \- org.jetbrains.kotlin:kotlin-stdlib-jdk7:jar:1.7.10:runtime
[INFO] |  \- io.fabric8:zjsonpatch:jar:0.3.0:compile
[INFO] +- io.fabric8:kubernetes-model:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-core:jar:6.6.1:compile
[INFO] |  |  \- io.fabric8:kubernetes-model-common:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-rbac:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-admissionregistration:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-apps:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-autoscaling:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-apiextensions:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-batch:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-certificates:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-coordination:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-discovery:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-events:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-extensions:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-networking:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-metrics:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-policy:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-scheduling:jar:6.6.1:compile
[INFO] |  +- io.fabric8:kubernetes-model-storageclass:jar:6.6.1:compile
[INFO] |  \- io.fabric8:openshift-model:jar:6.6.1:compile
[INFO] |     \- io.fabric8:openshift-model-config:jar:6.6.1:compile

So the solution was to set the pom file to something like this (this is only the relevant parts):

    <dependencies>
        <dependency>
            <groupId>library</groupId>
            <artifactId>library</artifactId>
            <version>0.0.1</version>
            <exclusions>
                <exclusion>
                    <groupId>com.squareup.okhttp3</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
                <exclusion>
                    <groupId>com.squareup.okio</groupId>
                    <artifactId>*</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>io.fabric8</groupId>
            <artifactId>kubernetes-client</artifactId>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>io.fabric8</groupId>
            <artifactId>kubernetes-model</artifactId>
            <scope>compile</scope>
        </dependency>
    </dependencies>
    <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>io.fabric8</groupId>
                <artifactId>kubernetes-client-bom</artifactId>
                <version>6.6.1</version>
                <type>pom</type>
                <scope>import</scope>
            </dependency>
        </dependencies>
    </dependencyManagement>

I think that this is enough for you to recreate the problem, the strange part is that it worked as intended as long as the response from kubernetes was below a certain size, the limit was somewhere between 1.5Mb and 6.8Mb.

shawkins commented 1 year ago

Quite strange. Tried to reproduce a couple of times with 10000 configmaps making for over 7MB in a single list, but no luck. My best guess is still a Throwable that is not being logged by OkHttpClientImpl nor the default uncaught exception handler. If we see something like this again, we should definitely go ahead and change those to catch throwable just in case.

MikaelUlvesjo commented 1 year ago

I think that I found an other thing that can be related, we use log4j in our project and when I enabled that in my test project it stops working. So it works with org.slf4j.simple.SimpleServiceProvider But not with org.apache.logging.slf4j.SLF4JServiceProvider or pure Log4j without slf4j at all.

MikaelUlvesjo commented 1 year ago

Found the problem with the logging, we have log level to trace globally and save the logs to a database and when io.fabric8.kubernetes.client.http.HttpLoggingInterceptor tries to log the entire body something fails in the logger and when I set it works!.

shawkins commented 1 year ago

Thank you for the follow up, it's good to have a definitive diagnosis. We'll probably want to be more conservative with the size of message we're attempting to log.

MikaelUlvesjo commented 1 year ago

Still do not know why io.fabric8.kubernetes.client.http.HttpLoggingInterceptor fails to log a ~7Mb log row, I tried to find where the limit was and I was able to log up to 64mb log rows with a random string using formatter then I stopped as the log file and db where getting big as I increased the log row by one Mb at the time in a loop...

shawkins commented 1 year ago

Still do not know why io.fabric8.kubernetes.client.http.HttpLoggingInterceptor fails to log a ~7Mb log row

It's a problem with the interceptor logic. It's skipping the call to the actual consumer once the response goes over 2MiB - this will need to be fixed. For now you should disable the trace logging, at least for the http logger.

fabric8io / kubernetes-client