apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.29k stars 3.59k forks source link

Pulsar Bookie dies unexpectedly #15258

Open Raven888888 opened 2 years ago

Raven888888 commented 2 years ago

Hello! I'd like to ask about 2 issues that I've faced:

Question 1:

We have 3 servers, each hosted Pulsar Broker, BK and ZK as a cluster. And for unknown reason, the Broker runs for a minute and stopped due to timeout error. And we notice that there are warnings from broker.log in Server1:

14:24:52.311 [pulsar-client-io-48-1] WARN  org.apache.pulsar.client.impl.ConnectionPool - Failed to open connection to Server2:6650 : io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
14:24:52.311 [pulsar-client-io-48-1] WARN  org.apache.pulsar.client.impl.ConnectionHandler - [persistent://public/flink_u/__change_events] [reader-1a857264a8] Error connecting to broker: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
14:24:52.311 [pulsar-client-io-48-1] WARN  org.apache.pulsar.client.impl.PulsarClientImpl - [persistent://public/flink_u/__change_events] Failed to get create topic reader
java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_312]
        at org.apache.pulsar.client.impl.ConsumerImpl.connectionFailed(ConsumerImpl.java:994) ~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
        at org.apache.pulsar.client.impl.ConnectionHandler.handleConnectionError(ConnectionHandler.java:82) ~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_312]
        at org.apache.pulsar.client.impl.ConnectionPool.lambda$null$9(ConnectionPool.java:217) ~[org.apache.pulsar-pulsar-client-original-2.7.0.jar:2.7.0]
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384) [io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]
Caused by: org.apache.pulsar.client.api.PulsarClientException: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
        ... 8 more
Caused by: java.util.concurrent.CompletionException: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
        at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.uniAccept(CompletableFuture.java:661) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:646) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_312]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_312]
        at org.apache.pulsar.common.util.netty.ChannelFutures.lambda$toCompletableFuture$0(ChannelFutures.java:59) ~[org.apache.pulsar-pulsar-common-2.7.0.jar:2.7.0]
        at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.fulfillConnectPromise(AbstractEpollChannel.java:636) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:655) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        ... 4 more
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: Server2/X.X.X.2:6650
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
        at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124) ~[io.netty-netty-transport-native-unix-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.channel.unix.Socket.finishConnect(Socket.java:243) ~[io.netty-netty-transport-native-unix-common-4.1.51.Final.jar:4.1.51.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:672) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:649) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:529) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:465) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) ~[io.netty-netty-transport-native-epoll-4.1.51.Final-linux-x86_64.jar:4.1.51.Final]
        ... 4 more

We know that this is due to the Broker in Server2 unexpectedly restarted, but this also happens on Server1 and Server3 as well. We just couldn't find which part is the one we need to investigate.Note that the ZK cluster are working fine, and BK shows a lot of these warnings and errors in bookie.log:

10:57:40.415 [BookieHighPriorityThread-3181-OrderedExecutor-3-0] WARN  org.apache.bookkeeper.proto.ReadEntryProcessor - Ledger: 112564971  fenced by: /X.X.X.1:39188

= OR =

10:57:09.890 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while performing readLac from ledger: 76193704
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 76193704 not found
        at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.bookie.Bookie.getExplicitLac(Bookie.java:1364) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:71) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
        at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_312]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_312]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_312]

Question 2:

In addition, not sure whether is the right place to ask, but is there any way to rebuild or recover the state of bookkeeper cluster in ZK after we ran bin/bookkeeper shell metaformat?Currently our broker constantly shown this error below in broker.log:

13:39:24.350 [bookkeeper-ml-workers-OrderedExecutor-7-0] ERROR org.apache.bookkeeper.client.LedgerCreateOp - Not enough bookies to create ledger

In addition, when attempting to create a topic, this error happens in broker.log as well:

13:39:24.350 [bookkeeper-ml-workers-OrderedExecutor-7-0] WARN  org.apache.pulsar.broker.service.BrokerService - Failed to create topic persistent://public/data
org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available

I've ran bin/bookkeeper shell listfilesondisc -journal and bin/bookkeeper shell listfilesondisc -entryLog, which I'm not sure whether I can recover back to previous state based on the available journal and entryLog files.Appreciate if anyone can help out. Many thanks! :smile:

PS: This is the first time I face this issue in ~8 months using Pulsar. Unfortunately, I cannot reproduce it myself in my staging environment.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.