apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.26k stars 3.59k forks source link

creat ledger timeout and thread BookKeeperClientWorker in stuck #14311

Open fu-turer opened 2 years ago

fu-turer commented 2 years ago

Describe the bug we found many timeout logs when creating producer or sending message.like this,we creat a new topic and then build a producer to send message:

java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout 
        at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) ~[?:?]
        at java.util.concurrent.CompletableFuture$UniApply.tryFire(Unknown Source) ~[?:?]
        at java.util.concurrent.CompletableFuture.postComplete(Unknown Source) ~[?:?]
        at java.util.concurrent.CompletableFuture.completeExceptionally(Unknown Source) ~[?:?]

we check the stack trace of create ledger and found that when creating ledger complete it can't get the thread BookKeeperClientWorker to do callback's work

image

we dump tread stack, BookKeeperClientWorker-OrderedExecutor-0-0 is always BLOCKED

"BookKeeperClientScheduler-OrderedScheduler-0-0" #26 prio=5 os_prio=0 cpu=61864.97ms elapsed=268089.10s tid=0x00007f88598ef800 nid=0x17b in Object.wait()  [0x00007f87fa5e1000]
   java.lang.Thread.State: WAITING (on object monitor)
        at java.lang.Object.wait(java.base@11.0.12/Native Method)
        - waiting on <no object reference available>
        at java.lang.Object.wait(java.base@11.0.12/Unknown Source)
        at io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:275)
        - waiting to re-lock in wait() <0x00000000cb13f9c0> (a io.netty.channel.DefaultChannelPromise)
        at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:137)
        at io.netty.channel.DefaultChannelPromise.awaitUninterruptibly(DefaultChannelPromise.java:30)
        at org.apache.bookkeeper.proto.PerChannelBookieClient.closeInternal(PerChannelBookieClient.java:1081)
        at org.apache.bookkeeper.proto.PerChannelBookieClient.disconnect(PerChannelBookieClient.java:1034)
        at org.apache.bookkeeper.proto.PerChannelBookieClient.disconnect(PerChannelBookieClient.java:1029)
        at org.apache.bookkeeper.proto.PerChannelBookieClient.failTLS(PerChannelBookieClient.java:2555)
        - locked <0x00000000b8293218> (a org.apache.bookkeeper.proto.PerChannelBookieClient)
        at org.apache.bookkeeper.proto.PerChannelBookieClient.access$2900(PerChannelBookieClient.java:154)
        at org.apache.bookkeeper.proto.PerChannelBookieClient$StartTLSCompletion.errorOut(PerChannelBookieClient.java:1961)
        at org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue.timeout(PerChannelBookieClient.java:1614)
        at org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue.maybeTimeout(PerChannelBookieClient.java:1606)
        at org.apache.bookkeeper.proto.PerChannelBookieClient.lambda$static$3(PerChannelBookieClient.java:1011)
        at org.apache.bookkeeper.proto.PerChannelBookieClient$$Lambda$596/0x00000001006f8440.test(Unknown Source)
        at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap$Section.removeIf(ConcurrentOpenHashMap.java:411)
        at org.apache.bookkeeper.util.collections.ConcurrentOpenHashMap.removeIf(ConcurrentOpenHashMap.java:172)
        at org.apache.bookkeeper.proto.PerChannelBookieClient.checkTimeoutOnPendingOperations(PerChannelBookieClient.java:1015)
        at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.checkTimeoutOnPendingOperations(DefaultPerChannelBookieClientPool.java:132)
        at org.apache.bookkeeper.proto.BookieClientImpl.monitorPendingOperations(BookieClientImpl.java:572)
        at org.apache.bookkeeper.proto.BookieClientImpl.lambda$new$0(BookieClientImpl.java:131)
        at org.apache.bookkeeper.proto.BookieClientImpl$$Lambda$163/0x000000010038e840.run(Unknown Source)
        at org.apache.bookkeeper.util.SafeRunnable$1.safeRun(SafeRunnable.java:43)
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
        at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator$NeverSuccessfulListenableFutureTask.run(MoreExecutors.java:705)
        at java.util.concurrent.Executors$RunnableAdapter.call(java.base@11.0.12/Unknown Source)
        at java.util.concurrent.FutureTask.runAndReset(java.base@11.0.12/Unknown Source)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(java.base@11.0.12/Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.12/Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.12/Unknown Source)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@11.0.12/Unknown Source)

"BookKeeperClientWorker-OrderedExecutor-0-0" #27 prio=5 os_prio=0 cpu=616668.69ms elapsed=268089.10s tid=0x00007f88598f3000 nid=0x17c waiting for monitor entry  [0x00007f87fa4e0000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.bookkeeper.proto.PerChannelBookieClient.connectIfNeededAndDoOp(PerChannelBookieClient.java:631)
        - waiting to lock <0x00000000b8293218> (a org.apache.bookkeeper.proto.PerChannelBookieClient)
        at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.obtain(DefaultPerChannelBookieClientPool.java:121)
        at org.apache.bookkeeper.proto.DefaultPerChannelBookieClientPool.obtain(DefaultPerChannelBookieClientPool.java:116)
        at org.apache.bookkeeper.proto.BookieClientImpl.addEntry(BookieClientImpl.java:329)
        at org.apache.bookkeeper.client.PendingAddOp.sendWriteRequest(PendingAddOp.java:152)
        at org.apache.bookkeeper.client.PendingAddOp.safeRun(PendingAddOp.java:278)
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@11.0.12/Unknown Source)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@11.0.12/Unknown Source)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(java.base@11.0.12/Unknown Source)

To Reproduce i don't known how to reproduce

Additional context version:2.9.1

fistan684 commented 2 years ago

Since upgrading from 2.7.1 to 2.9.1 we also have been hitting issues where our producers will begin failing. Typically we'll have one broker that isn't responding to requests and restarting it will fix the issue. From looking at the broker logs we see a similar error to what is describe in this issue:

2022-02-18T17:35:06,772+0000 [pulsar-io-4-7] WARN org.apache.pulsar.broker.service.ServerCnx - [/10.100.209.43:51632][persistent://prod/voltron-general/871_2b2d84be150dcf9c_MAID_DELETE_6333758_4bb66664126194f7-partition-0][voltron] Failed to create consumer: consumerId=0, Failed to load topic within timeout java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout ... at org.apache.pulsar.common.util.FutureUtil.lambda$addTimeoutHandling$1(FutureUtil.java:141) ~[org.apache.pulsar-pulsar-common-2.9.1.jar:2.9.1] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final] ... Caused by: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout

This seems to be happening randomly every 4-7 hours since we upgraded. We typically write to a lot of topics in a given namespace. We'll try to capture thread state next time it happens.

michaeljmarshall commented 2 years ago

@fu-turer thanks for opening this issue and providing the thread dumps. It looks like the blocked thread is in the bookkeeper client. Based on your thread stack, I'm assuming you're running with TLS enabled, is that correct? Do you see a log line with TLS failure on in it? It should end with a number, and that number is a bk error code.

dlg99 commented 2 years ago

@fu-turer is there anything in the logs related to not being able to connect to bookies because of e.g. a TLS issue? Or attempt to use TLS 1.3? see https://github.com/apache/bookkeeper/issues/2711

@fistan684 are you rotating TLS certificates every few hours? any chance that certificates occasionally expire before new ones deployed on the bookie nodes?

fu-turer commented 2 years ago

@dlg99 i use TLSv1.2.

@michaeljmarshall there are 3 bookies, 2 brokers in my test environment. before problem happen, 3 bookies are all down, after all bookies recover, one broker's thead in stuck, and the anther one is normal

dlg99 commented 2 years ago

before problem happen, 3 bookies are all down

this would explain failTLS in the stack :) Why are they all going down? will the thread unblock after some time (some timeouts can be in the range of 30sec or so by default) did the bookie IP address change after the bookie came back up? JVM caches dns lookups etc.

bookie client (pulsar broker side) should log what it is doing, trying to connect/failing, certificate errors etc. I recommend checking the log (at least to make sure that bookie client logs there). With that please attach a full log from the client side and from the bookies and a full thread dump.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.