Open fu-turer opened 2 years ago
Since upgrading from 2.7.1 to 2.9.1 we also have been hitting issues where our producers will begin failing. Typically we'll have one broker that isn't responding to requests and restarting it will fix the issue. From looking at the broker logs we see a similar error to what is describe in this issue:
2022-02-18T17:35:06,772+0000 [pulsar-io-4-7] WARN org.apache.pulsar.broker.service.ServerCnx - [/10.100.209.43:51632][persistent://prod/voltron-general/871_2b2d84be150dcf9c_MAID_DELETE_6333758_4bb66664126194f7-partition-0][voltron] Failed to create consumer: consumerId=0, Failed to load topic within timeout java.util.concurrent.CompletionException: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout ... at org.apache.pulsar.common.util.FutureUtil.lambda$addTimeoutHandling$1(FutureUtil.java:141) ~[org.apache.pulsar-pulsar-common-2.9.1.jar:2.9.1] at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98) [io.netty-netty-common-4.1.72.Final.jar:4.1.72.Final] ... Caused by: org.apache.pulsar.common.util.FutureUtil$LowOverheadTimeoutException: Failed to load topic within timeout
This seems to be happening randomly every 4-7 hours since we upgraded. We typically write to a lot of topics in a given namespace. We'll try to capture thread state next time it happens.
@fu-turer thanks for opening this issue and providing the thread dumps. It looks like the blocked thread is in the bookkeeper client. Based on your thread stack, I'm assuming you're running with TLS enabled, is that correct? Do you see a log line with TLS failure on
in it? It should end with a number, and that number is a bk error code.
@fu-turer is there anything in the logs related to not being able to connect to bookies because of e.g. a TLS issue? Or attempt to use TLS 1.3? see https://github.com/apache/bookkeeper/issues/2711
@fistan684 are you rotating TLS certificates every few hours? any chance that certificates occasionally expire before new ones deployed on the bookie nodes?
@dlg99 i use TLSv1.2.
@michaeljmarshall there are 3 bookies, 2 brokers in my test environment. before problem happen, 3 bookies are all down, after all bookies recover, one broker's thead in stuck, and the anther one is normal
before problem happen, 3 bookies are all down
this would explain failTLS
in the stack :) Why are they all going down?
will the thread unblock after some time (some timeouts can be in the range of 30sec or so by default)
did the bookie IP address change after the bookie came back up? JVM caches dns lookups etc.
bookie client (pulsar broker side) should log what it is doing, trying to connect/failing, certificate errors etc. I recommend checking the log (at least to make sure that bookie client logs there). With that please attach a full log from the client side and from the bookies and a full thread dump.
The issue had no activity for 30 days, mark with Stale label.
The issue had no activity for 30 days, mark with Stale label.
Describe the bug we found many timeout logs when creating producer or sending message.like this,we creat a new topic and then build a producer to send message:
we check the stack trace of create ledger and found that when creating ledger complete it can't get the thread
BookKeeperClientWorker
to do callback's workwe dump tread stack,
BookKeeperClientWorker-OrderedExecutor-0-0
is always BLOCKEDTo Reproduce i don't known how to reproduce
Additional context version:2.9.1