fabric8io / kubernetes-client

Java client for Kubernetes & OpenShift
http://fabric8.io
Apache License 2.0
3.39k stars 1.46k forks source link

Deadlock when using SharedIndexInformer #6354

Closed bzhangyan11 closed 8 minutes ago

bzhangyan11 commented 1 day ago

Describe the bug

My service couldn't receive any events from apiserver and I found some deadlocks in jstack.But i couldn't find the synchronizer 0x00000006c049ca40 which is held by "OkHttp Dispatcher"

https://github.com/openjdk/jdk8u/blob/jdk8u332-ga/jdk/src/share/classes/sun/security/ssl/SSLSocketImpl.java#L935

3968 Found one Java-level deadlock:
3969 =============================
3970 "OkHttp Dispatcher":
3971   waiting to lock monitor 0x00007f6a47f56718 (object 0x00000006c348e388, a io.fabric8.kubernetes.client.dsl.internal.WatchHTTPManager),
3972   which is held by "-1504937617-pool-11-thread-6"
3973 "-1504937617-pool-11-thread-6":
3974   waiting for ownable synchronizer 0x00000006c049ca40, (a java.util.concurrent.locks.ReentrantLock$NonfairSync),
3975   which is held by "OkHttp Dispatcher"
3976 
3977 Java stack information for the threads listed above:
3978 ===================================================
3979 "OkHttp Dispatcher":
3980     at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.scheduleReconnect(AbstractWatchManager.java:233)
3981     - waiting to lock <0x00000006c348e388> (a io.fabric8.kubernetes.client.dsl.internal.WatchHTTPManager)
3982     at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.watchEnded(AbstractWatchManager.java:433)
3983     at io.fabric8.kubernetes.client.dsl.internal.WatchHTTPManager.lambda$null$1(WatchHTTPManager.java:82)
3984     at io.fabric8.kubernetes.client.dsl.internal.WatchHTTPManager$$Lambda$745/1901952678.accept(Unknown Source)
3985     at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:774)
3986     at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:750)
3987     at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
3988     at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
3989     at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.doConsume(OkHttpClientImpl.java:141)
3990     at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody$$Lambda$703/1759056657.run(Unknown Source)
3991     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
3992     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)                                                                                        
3993     at java.lang.Thread.run(Thread.java:879)
3994 "-1504937617-pool-11-thread-6":
3995     at sun.misc.Unsafe.park0(Native Method)
3996     - parking to wait for  <0x00000006c049ca40> (a java.util.concurrent.locks.ReentrantLock$NonfairSync)
3997     at sun.misc.Unsafe.park(Unsafe.java:1038)
3998     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:176)
3998     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:176)
3999     at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:842)
4000     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:876)
4001     at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1207)
4002     at java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
4003     at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
4004     at sun.security.ssl.SSLSocketImpl$AppInputStream.read(SSLSocketImpl.java:935)
4005     at okio.InputStreamSource.read(JvmOkio.kt:90)
4006     at okio.AsyncTimeout$source$1.read(AsyncTimeout.kt:129)
4007     at okio.RealBufferedSource.indexOf(RealBufferedSource.kt:427)
4008     at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.kt:320)
4009     at okio.RealBufferedSource.readUtf8LineStrict(RealBufferedSource.kt:105)
4010     at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.readChunkSize(Http1ExchangeCodec.kt:426)
4011     at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.read(Http1ExchangeCodec.kt:408)
4012     at okhttp3.internal.Util.skipAll(Util.kt:344)
4013     at okhttp3.internal.Util.discard(Util.kt:365)
4014     at okhttp3.internal.http1.Http1ExchangeCodec$ChunkedSource.close(Http1ExchangeCodec.kt:450)
4015     at okio.ForwardingSource.close(ForwardingSource.kt:34)
4016     at okhttp3.internal.connection.Exchange$ResponseBodySource.close(Exchange.kt:309)
4017     at okio.RealBufferedSource.close(RealBufferedSource.kt:477)
4018     at io.fabric8.kubernetes.client.utils.Utils.closeQuietly(Utils.java:206)
4019     at io.fabric8.kubernetes.client.utils.Utils.closeQuietly(Utils.java:220)
4020     at io.fabric8.kubernetes.client.okhttp.OkHttpClientImpl$OkHttpAsyncBody.cancel(OkHttpClientImpl.java:154)
4021     at io.fabric8.kubernetes.client.dsl.internal.WatchHTTPManager$$Lambda$2522/246463238.accept(Unknown Source)
4022     at java.util.Optional.ifPresent(Optional.java:159)
4023     at io.fabric8.kubernetes.client.dsl.internal.WatchHTTPManager.closeCurrentRequest(WatchHTTPManager.java:93)
4024     - locked <0x00000006c348e388> (a io.fabric8.kubernetes.client.dsl.internal.WatchHTTPManager)
4025     at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.closeRequest(AbstractWatchManager.java:161)
4026     - locked <0x00000006c348e388> (a io.fabric8.kubernetes.client.dsl.internal.WatchHTTPManager)
4027     at io.fabric8.kubernetes.client.informers.impl.cache.Reflector$$Lambda$2520/721648509.accept(Unknown Source)
4028     at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:670)
4029     at java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:683)
4030     at java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2010)
4031     at io.fabric8.kubernetes.client.informers.impl.cache.Reflector.lambda$startWatcher$12(Reflector.java:239)
4032     at io.fabric8.kubernetes.client.informers.impl.cache.Reflector$$Lambda$739/2045679137.run(Unknown Source)
4033     at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1640)
4034     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
4035     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)                                                                                        
4036     at java.lang.Thread.run(Thread.java:879)

Fabric8 Kubernetes Client version

6.13.3

Steps to reproduce

I don't know how to reproduce it.

Expected behavior

no deadlock

Runtime

Kubernetes (vanilla)

Kubernetes API Server version

other (please specify in additional context)

Environment

Linux

Fabric8 Kubernetes Client Logs

No response

Additional context

No response

shawkins commented 1 day ago

http watches unfortunately haven't gotten as much scrutiny as websocket watches, and this does indeed look like a deadlock in the current state of that code. Are you choosing to use http watches or is that happening for you?

shawkins commented 1 day ago

Also this appears to be okhttp specific. You could try one of the other http clients instead.

bzhangyan11 commented 1 day ago

http watches unfortunately haven't gotten as much scrutiny as websocket watches, and this does indeed look like a deadlock in the current state of that code. Are you choosing to use http watches or is that happening for you?

yes, i'm using http watches.

bzhangyan11 commented 1 day ago

Previously,my service got stock with websocket(can't receive any events from apiserver),so i upgraded fabric8 from 5.X to 6.13.3 and chose http watches instead.

bzhangyan11 commented 1 day ago

http watches unfortunately haven't gotten as much scrutiny as websocket watches, and this does indeed look like a deadlock in the current state of that code. Are you choosing to use http watches or is that happening for you?

and it seems that this issue only occurs in specific versions of the JDK. https://github.com/openjdk/jdk8u/blob/jdk8u332-ga/jdk/src/share/classes/sun/security/ssl/SSLSocketImpl.java#L935

works well in this version of SSLSocketImpl. https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/sun/security/ssl/SSLSocketImpl.java https://github.com/openjdk/jdk/blob/jdk8-b120/jdk/src/share/classes/sun/security/ssl/AppInputStream.java

shawkins commented 19 hours ago

@bzhangyan11 thanks for the additional context and links. The proposed pr should be safe regardless - our http client layer over okhttp attempts to make it appear as an async api, but it's actually blocking. We hadn't seen this blocking on close behavior before - which you are pointing out looks to even be JRE specific. So if you can't change your JRE, and you need to keep using http watches, the only other near term workaround is to use a different http client.