Hello! I'd like to ask about 2 issues that I've faced:
Question 1:
We have 3 servers, each hosted Pulsar Broker, BK and ZK as a cluster. And for unknown reason, the Broker runs for a minute and stopped due to timeout error. And we notice that there are warnings from broker.log in Server1:
14:24:52.311 [pulsar-client-io-48-1] WARN org.apache.pulsar.client.impl.ConnectionPool - Failed to open connection to Server2:6650 : io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
14:24:52.311 [pulsar-client-io-48-1] WARN org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/flink_u/__change_events] [reader-1a857264a8] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
14:24:52.311 [pulsar-client-io-48-1] WARN org.apache.pulsar.client.impl.PulsarClientImpl - [persistent://public/flink_u/__change_events] Failed to get create topic reader
java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_312]
at org.apache.pulsar.client.impl.ConsumerImpl.connectionFailed(ConsumerImpl.java:994) ~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
at org.apache.pulsar.client.impl.ConnectionHandler.handleConnectionError(ConnectionHandler.java:82) ~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_312]
at org.apache.pulsar.client.impl.ConnectionPool.lambda$null$9(ConnectionPool.java:217) ~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) [io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
Caused by: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
... 8 more
Caused by: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:661) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_312]
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_312]
at org.apache.pulsar.common.util.netty.ChannelFutures.lambda$toCompletableFuture$0(ChannelFutures.java:59) ~[org.apache.pulsar-pulsar-common-2.7.0.jar:2.7.0]
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:636) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:655) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124) ~[io.netty-netty-transport-native-unix-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.channel.unix.Socket.finishConnect(Socket.java:243) ~[io.netty-netty-transport-native-unix-common-4.1.51.Final.jar:4.1.51.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
... 4 more
We know that this is due to the Broker in Server2 unexpectedly restarted, but this also happens on Server1 and Server3 as well. We just couldn't find which part is the one we need to investigate.Note that the ZK cluster are working fine, and BK shows a lot of these warnings and errors in bookie.log:
10:57:09.890 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while performing readLac from ledger: 76193704
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 76193704 not found
at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.Bookie.getExplicitLac(Bookie.java:1364) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:71) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
Question 2:
In addition, not sure whether is the right place to ask, but is there any way to rebuild or recover the state of bookkeeper cluster in ZK after we ran bin/bookkeeper shell metaformat?Currently our broker constantly shown this error below in broker.log:
13:39:24.350 [bookkeeper-ml-workers-OrderedExecutor-7-0] ERROR org.apache.bookkeeper.client.LedgerCreateOp - Not enough bookies to create ledger
In addition, when attempting to create a topic, this error happens in broker.log as well:
13:39:24.350 [bookkeeper-ml-workers-OrderedExecutor-7-0] WARN org.apache.pulsar.broker.service.BrokerService - Failed to create topic persistent://public/data
org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available
I've ran bin/bookkeeper shell listfilesondisc -journal and bin/bookkeeper shell listfilesondisc -entryLog, which I'm not sure whether I can recover back to previous state based on the available journal and entryLog files.Appreciate if anyone can help out. Many thanks! :smile:
PS: This is the first time I face this issue in ~8 months using Pulsar. Unfortunately, I cannot reproduce it myself in my staging environment.
Hello! I'd like to ask about 2 issues that I've faced:
Question 1:
We have 3 servers, each hosted Pulsar Broker, BK and ZK as a cluster. And for unknown reason, the Broker runs for a minute and stopped due to timeout error. And we notice that there are warnings from broker.log in Server1:
We know that this is due to the Broker in Server2 unexpectedly restarted, but this also happens on Server1 and Server3 as well. We just couldn't find which part is the one we need to investigate.Note that the ZK cluster are working fine, and BK shows a lot of these warnings and errors in bookie.log:
= OR =
Question 2:
In addition, not sure whether is the right place to ask, but is there any way to rebuild or recover the state of bookkeeper cluster in ZK after we ran
bin/bookkeeper shell metaformat
?Currently our broker constantly shown this error below in broker.log:In addition, when attempting to create a topic, this error happens in broker.log as well:
I've ran
bin/bookkeeper shell listfilesondisc -journal
andbin/bookkeeper shell listfilesondisc -entryLog
, which I'm not sure whether I can recover back to previous state based on the available journal and entryLog files.Appreciate if anyone can help out. Many thanks! :smile:PS: This is the first time I face this issue in ~8 months using Pulsar. Unfortunately, I cannot reproduce it myself in my staging environment.