apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.24k stars 3.59k forks source link

Not enough non-faulty bookies available #13167

Open yunfeng79 opened 2 years ago

yunfeng79 commented 2 years ago

apache-pulsar-2.8.0

three pulsar cluster

second pulsar messages:

16:33:15.772 [BookKeeperClientWorker-OrderedExecutor-0-0] WARN org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (135, 0): Bookie handle is not available 16:33:15.801 [BookKeeperClientWorker-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - UpdateLoop(ledgerId=135,loopId=413ab672) Exception updating org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available at org.apache.bookkeeper.bookie.LocalBookieEnsemblePlacementPolicy.replaceBookie(LocalBookieEnsemblePlacementPolicy.java:89) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.BookieWatcherImpl.replaceBookie(BookieWatcherImpl.java:334) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.EnsembleUtils.replaceBookiesInEnsemble(EnsembleUtils.java:71) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.LedgerHandle.lambda$ensembleChangeLoop$2(LedgerHandle.java:1935) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.MetadataUpdateLoop.writeLoop(MetadataUpdateLoop.java:134) [org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.MetadataUpdateLoop.run(MetadataUpdateLoop.java:123) [org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.LedgerHandle.ensembleChangeLoop(LedgerHandle.java:1954) [org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.LedgerHandle.handleBookieFailure(LedgerHandle.java:1903) [org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.PendingAddOp.writeComplete(PendingAddOp.java:378) [org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.proto.BookieClientImpl$1.safeRun(BookieClientImpl.java:295) [org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.14.1.jar:4.14.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_231] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_231] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231] 16:33:15.802 [BookKeeperClientWorker-OrderedExecutor-0-0] WARN org.apache.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:135, change-id:0000000001)][attempt:1] Exception changing ensemble org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available at org.apache.bookkeeper.bookie.LocalBookieEnsemblePlacementPolicy.replaceBookie(LocalBookieEnsemblePlacementPolicy.java:89) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.BookieWatcherImpl.replaceBookie(BookieWatcherImpl.java:334) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.EnsembleUtils.replaceBookiesInEnsemble(EnsembleUtils.java:71) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.LedgerHandle.lambda$ensembleChangeLoop$2(LedgerHandle.java:1935) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.MetadataUpdateLoop.writeLoop(MetadataUpdateLoop.java:134) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.MetadataUpdateLoop.run(MetadataUpdateLoop.java:123) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.LedgerHandle.ensembleChangeLoop(LedgerHandle.java:1954) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.LedgerHandle.handleBookieFailure(LedgerHandle.java:1903) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.client.PendingAddOp.writeComplete(PendingAddOp.java:378) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.proto.BookieClientImpl$1.safeRun(BookieClientImpl.java:295) ~[org.apache.bookkeeper-bookkeeper-server-4.14.1.jar:4.14.1] at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[org.apache.bookkeeper-bookkeeper-common-4.14.1.jar:4.14.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_231] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_231] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_231]

ericsyh commented 2 years ago

@yunfeng79 Not enough non-faulty bk means your alive bk node number is less than broker‘s managedLedgerDefaultEnsembleSize config request https://github.com/apache/pulsar/blob/master/conf/broker.conf#L902. You can take a check on it.

sicklife commented 2 years ago

I got the same error, my broker config is managedLedgerDefaultEnsembleSize=2

what should I do to fix this error?

ericsyh commented 2 years ago

Pls run the command bin/bookkeeper shell listbookies -rw -h to get the list the ReadWrite Bookies. And check that all broker configs keep consistent managedLedgerDefaultEnsembleSize, managedLedgerDefaultWriteQuorum, managedLedgerDefaultAckQuorum.

Then check that ReadWrite Bookies number >= managedLedgerDefaultEnsembleSize >= managedLedgerDefaultWriteQuorum >= managedLedgerDefaultAckQuorum.

yunfeng79 commented 2 years ago

@ericsyh check that ReadWrite Bookies number is same,Still have the same error?

yunfeng79 commented 2 years ago

java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$BrokerPersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: Not enough non-faulty bookies available at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_301] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_301] at java.util.concurrent.CompletableFuture.uniRun(CompletableFuture.java:714) ~[?:1.8.0_301] at java.util.concurrent.CompletableFuture$UniRun.tryFire(CompletableFuture.java:701) ~[?:1.8.0_301] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) [?:1.8.0_301] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) [?:1.8.0_301] at org.apache.pulsar.client.impl.ClientCnx.handleError(ClientCnx.java:691) [org.apache.pulsar-pulsar-client-original-2.8.0.jar:2.8.0] at org.apache.pulsar.common.protocol.PulsarDecoder.channelRead(PulsarDecoder.java:176) [org.apache.pulsar-pulsar-common-2.8.0.jar:2.8.0] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:324) [io.netty-netty-codec-4.1.63.Final.jar:4.1.63.Final] at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:296) [io.netty-netty-codec-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) [io.netty-netty-transport-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795) [io.netty-netty-transport-native-epoll-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) [io.netty-netty-transport-native-epoll-4.1.63.Final.jar:4.1.63.Final] at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [io.netty-netty-transport-native-epoll-4.1.63.Final.jar:4.1.63.Final] at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final] at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.63.Final.jar:4.1.63.Final]

github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

Hongten commented 2 years ago

I encountered the same issue when I run bin/bookkeeper shell bookiesanity. I'm using the pulsar-2.10.0.

2022-05-02T12:44:16,501+0800 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookie.SanityTestCommand - Create ledger 5
2022-05-02T12:44:16,656+0800 [bookkeeper-io-3-1] ERROR org.apache.bookkeeper.proto.PerChannelBookieClient - Could not connect to bookie: [id: 0x5a822f2d, L:/host:59178]/host:3181, current state CONNECTING :
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /host:3181
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
    at io.netty.channel.unix.Errors.newConnectException0(Errors.java:155) ~[io.netty-netty-transport-native-unix-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.unix.Errors.handleConnectErrno(Errors.java:128) ~[io.netty-netty-transport-native-unix-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.unix.Socket.finishConnect(Socket.java:320) ~[io.netty-netty-transport-native-unix-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:710) [io.netty-netty-transport-classes-epoll-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:687) [io.netty-netty-transport-classes-epoll-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:567) [io.netty-netty-transport-classes-epoll-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470) [io.netty-netty-transport-classes-epoll-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) [io.netty-netty-transport-classes-epoll-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
2022-05-02T12:44:16,677+0800 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.bookkeeper.client.PendingAddOp - Failed to write entry (5, 0): Bookie handle is not available
2022-05-02T12:44:16,691+0800 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - UpdateLoop(ledgerId=5,loopId=567e98f4) Exception updating
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
    at org.apache.bookkeeper.bookie.LocalBookieEnsemblePlacementPolicy.replaceBookie(LocalBookieEnsemblePlacementPolicy.java:89) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.BookieWatcherImpl.replaceBookie(BookieWatcherImpl.java:334) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.EnsembleUtils.replaceBookiesInEnsemble(EnsembleUtils.java:71) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.LedgerHandle.lambda$ensembleChangeLoop$2(LedgerHandle.java:1935) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.MetadataUpdateLoop.writeLoop(MetadataUpdateLoop.java:134) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.MetadataUpdateLoop.run(MetadataUpdateLoop.java:123) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.LedgerHandle.ensembleChangeLoop(LedgerHandle.java:1954) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.LedgerHandle.handleBookieFailure(LedgerHandle.java:1903) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.PendingAddOp.writeComplete(PendingAddOp.java:378) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.proto.BookieClientImpl$1.safeRun(BookieClientImpl.java:295) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
2022-05-02T12:44:16,693+0800 [BookKeeperClientWorker-OrderedExecutor-5-0] WARN  org.apache.bookkeeper.client.LedgerHandle - [EnsembleChange(ledger:5, change-id:0000000001)][attempt:1] Exception changing ensemble
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
    at org.apache.bookkeeper.bookie.LocalBookieEnsemblePlacementPolicy.replaceBookie(LocalBookieEnsemblePlacementPolicy.java:89) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.BookieWatcherImpl.replaceBookie(BookieWatcherImpl.java:334) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.EnsembleUtils.replaceBookiesInEnsemble(EnsembleUtils.java:71) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.LedgerHandle.lambda$ensembleChangeLoop$2(LedgerHandle.java:1935) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.MetadataUpdateLoop.writeLoop(MetadataUpdateLoop.java:134) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.MetadataUpdateLoop.run(MetadataUpdateLoop.java:123) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.LedgerHandle.ensembleChangeLoop(LedgerHandle.java:1954) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.LedgerHandle.handleBookieFailure(LedgerHandle.java:1903) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.PendingAddOp.writeComplete(PendingAddOp.java:378) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.proto.BookieClientImpl$1.safeRun(BookieClientImpl.java:295) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_262]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_262]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
2022-05-02T12:44:16,693+0800 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.client.LedgerHandle - Closing ledger 5 due to NotEnoughBookiesException: Not enough non-faulty bookies available
2022-05-02T12:44:16,696+0800 [BookKeeperClientWorker-OrderedExecutor-5-0] ERROR org.apache.bookkeeper.client.PendingAddOp - Write of ledger entry to quorum failed: L5 E0
2022-05-02T12:44:16,696+0800 [main] WARN  org.apache.bookkeeper.tools.cli.commands.bookie.SanityTestCommand - Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
    at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:92) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:431) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1799) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:574) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) ~[org.apache.bookkeeper-bookkeeper-common-4.14.4.jar:4.14.4]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_262]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_262]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
    at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_262]
2022-05-02T12:44:16,704+0800 [main-EventThread] WARN  org.apache.bookkeeper.meta.AbstractZkLedgerManager - Ledger node does not exist in ZooKeeper: ledgerId=5
2022-05-02T12:44:16,704+0800 [main-EventThread] ERROR org.apache.bookkeeper.client.MetadataUpdateLoop - UpdateLoop(ledgerId=5,loopId=48132b31) Error writing metadata to store
org.apache.bookkeeper.client.BKException$BKNoSuchLedgerExistsOnMetadataServerException: No such ledger exists on Metadata Server
    at org.apache.bookkeeper.meta.AbstractZkLedgerManager$4.processResult(AbstractZkLedgerManager.java:505) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.zookeeper.ZooKeeperClient$22$1.processResult(ZooKeeperClient.java:1094) [org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:638) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
    at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:563) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
2022-05-02T12:44:16,704+0800 [main-EventThread] WARN  org.apache.bookkeeper.client.LedgerHandle - Close failed: NoSuchLedgerExistsOnMetadataServerException: No such ledger exists on Metadata Server
2022-05-02T12:44:16,709+0800 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookie.SanityTestCommand - Deleted ledger 5
2022-05-02T12:44:16,710+0800 [main] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Closing the per channel bookie client for host:3181
2022-05-02T12:44:16,711+0800 [main] INFO  org.apache.bookkeeper.proto.PerChannelBookieClient - Closing the per channel bookie client for host:3181
2022-05-02T12:44:16,836+0800 [main] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x30c3eef1b130009 closed
2022-05-02T12:44:16,836+0800 [main-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x30c3eef1b130009
github-actions[bot] commented 2 years ago

The issue had no activity for 30 days, mark with Stale label.

sicklife commented 2 years ago

any update ? still stuck with this~

fmiguelez commented 2 years ago

Same problem here. We are unable to restart our cluster. Brokers keep on crashing all the time due to this issue.

Any workaround?

fmiguelez commented 2 years ago

Pls run the command bin/bookkeeper shell listbookies -rw -h to get the list the ReadWrite Bookies. And check that all broker configs keep consistent managedLedgerDefaultEnsembleSize, managedLedgerDefaultWriteQuorum, managedLedgerDefaultAckQuorum.

Then check that ReadWrite Bookies number >= managedLedgerDefaultEnsembleSize >= managedLedgerDefaultWriteQuorum >= managedLedgerDefaultAckQuorum.

When I issue that command from our toolset pod I get the following:

08:54:48.798 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - ReadWrite Bookies :
08:54:48.812 [main] INFO  org.apache.bookkeeper.tools.cli.commands.bookies.ListBookiesCommand - BookieID:rev-pulsar-bookie-2.rev-pulsar-bookie.mdp-dgt-review-check-data-ocdup2.svc.cluster.local:3181, IP:10.72.168.230, Port:3181, Hostname:rev-pulsar-bookie-2.rev-pulsar-bookie.mdp-dgt-review-check-data-ocdup2.svc.cluster.local
08:54:48.816 [main-SendThread(rev-pulsar-zookeeper:2181)] WARN  org.apache.zookeeper.ClientCnxn - An exception was thrown while closing send thread for session 0x10053b99a7100f5.
org.apache.zookeeper.ClientCnxn$EndOfStreamException: Unable to read additional data from server sessionid 0x10053b99a7100f5, likely server has closed socket
        at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:77) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
        at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350) ~[org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
        at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1290) [org.apache.zookeeper-zookeeper-3.6.3.jar:3.6.3]
08:54:48.933 [main] INFO  org.apache.zookeeper.ZooKeeper - Session: 0x10053b99a7100f5 closed
08:54:48.933 [main-EventThread] INFO  org.apache.zookeeper.ClientCnxn - EventThread shut down for session: 0x10053b99a7100f5

@ericsyh what do you suggest?

ericsyh commented 2 years ago

@fmiguelez Your bookie replica should be more than 3 but the returned readwrite bookie is only rev-pulsar-bookie-2.

Is the bookie pod rev-pulsar-bookie-0 and rev-pulsar-bookie-1 still running? You can check their disk free status by https://bookkeeper.apache.org/docs/admin/http#endpoint-apiv1bookieinfo

In the toolset pod, you can try curl rev-pulsar-bookie-0.rev-pulsar-bookie.mdp-dgt-review-check-data-ocdup2.svc.cluster.local:8000/api/v1/bookie/info

fmiguelez commented 2 years ago

@fmiguelez Your bookie replica should be more than 3 but the returned readwrite bookie is only rev-pulsar-bookie-2.

Is the bookie pod rev-pulsar-bookie-0 and rev-pulsar-bookie-1 still running? You can check their disk free status by https://bookkeeper.apache.org/docs/admin/http#endpoint-apiv1bookieinfo

In the toolset pod, you can try curl rev-pulsar-bookie-0.rev-pulsar-bookie.mdp-dgt-review-check-data-ocdup2.svc.cluster.local:8000/api/v1/bookie/info

It was actually a matter of disk space. Bookie logs reported disk usage beyond threshold, so they were set in read-only mode.

    fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-0
    09:05:58.784 [LedgerDirsMonitorThread] ERROR org.apache.bookkeeper.util.DiskChecker - Space left on device /pulsar/data/bookkeeper/ledgers/current : 2447400960, Used space fraction: 0.953402 > threshold 0.95.
    09:05:58.826 [SyncThread-7-1] INFO  org.apache.bookkeeper.bookie.LedgerDirsManager - No writable ledger dirs below diskUsageThreshold. But Dirs that can accommodate 1073741824 are: [/pulsar/data/bookkeeper/ledgers/current]
    fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-1
    09:05:57.685 [LedgerDirsMonitorThread] ERROR org.apache.bookkeeper.util.DiskChecker - Space left on device /pulsar/data/bookkeeper/ledgers/current : 2414260224, Used space fraction: 0.95403296 > threshold 0.95.
    09:05:57.734 [SyncThread-7-1] INFO  org.apache.bookkeeper.bookie.LedgerDirsManager - No writable ledger dirs below diskUsageThreshold. But Dirs that can accommodate 1073741824 are: [/pulsar/data/bookkeeper/ledgers/current]
    fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-2
    08:44:00.631 [bookie-io-1-1] INFO  org.apache.bookkeeper.proto.BookieRequestHandler - Channels disconnected: [id: 0x2b678089, L:/10.72.168.230:3181 ! R:/10.72.168.138:39958]
    08:56:59.187 [GarbageCollectorThread-11-1] INFO  org.apache.bookkeeper.bookie.GarbageCollectorThread - Disk almost full, suspend major compaction to slow down filling disk.
    fer@N315D154:~/ws/mdp/mdp-demo$ kubectl -n mdp-dgt-review-check-data-ocdup2 logs --tail 2 rev-pulsar-bookie-3
    09:07:18.867 [LedgerDirsMonitorThread] WARN  org.apache.bookkeeper.bookie.LedgerDirsMonitor - LedgerDirsMonitor check process: All ledger directories are non writable
    09:07:18.868 [LedgerDirsMonitorThread] ERROR org.apache.bookkeeper.util.DiskChecker - Space left on device /pulsar/data/bookkeeper/ledgers/current : 2292150272, Used space fraction: 0.9563579 > threshold 0.95.  

I modified the bookie configmap to set a threshold of 0.98.

fer@N315D154:~$ kubectl -n mdp-dgt-review-check-data-ocdup2 get configmaps rev-pulsar-bookie -o yaml > rev-pulsar-bookie-cm.yaml Edit pulsar-bookie-cm.yml and add following property: diskUsageThreshold: "0.98" And then applied changes: fer@N315D154:~$ kubectl -n mdp-dgt-review-check-data-ocdup2 apply -f rev-pulsar-bookie-cm.yaml

For the bookie pods to get the change I deleted manually one by one the bookie replica pods. After that broker was able to start.

However this situation lasted not long as eventually two of the bookies started to occupy all the space.

I analyzed all the topics stats and actually some of them returned error:

    root@rev-pulsar-toolset-0:/pulsar/bin# for i in $(./pulsar-admin namespaces list dbus); do topics=$(./pulsar-admin topics list $i); for topic in $topics; do echo $topic; ./pulsar-admin topics stats $topic | grep storageSize; done; done
    "persistent://dbus/dgt-measure-point-15min-local/traffic-data-events"
      "storageSize" : 43968,
    "persistent://dbus/dgt-measure-point-15min-local/status"
    HTTP 500 Server Error

    Reason: HTTP 500 Server Error
    "persistent://dbus/dgt-time-series-point/traffic-data-events"
      "storageSize" : 2732639642,
    "persistent://dbus/dgt-time-series-point/status"
    HTTP 500 Server Error

    Reason: HTTP 500 Server Error
    "persistent://dbus/test-ltp-config/variables"
      "storageSize" : 0,

Every bookie replica pod has a volume of 50 GiB for ledgers and 10 GiB for journals. The volumen being filled up was the one of the ledgers. As you can see only one topic has a considerable size (2.5 GiB), which does not make any sense.

There were no producers producing any more data at the moment of the error.

My suspicion is that the compaction process (enabled for all topics and set to 10 MB threshold) is the one filling up disk (probably with temporary data copy). It can be a problem as biggest topic does not have any repeating key (compaction does not make any sense in this case). I have observed logs like the following

11:19:53.242 [GarbageCollectorThread-11-1] INFO  org.apache.bookkeeper.bookie.GarbageCollectorThread - Disk almost full, suspend major compaction to slow down filling disk.
11:19:53.242 [GarbageCollectorThread-11-1] INFO  org.apache.bookkeeper.bookie.GarbageCollectorThread - Enter minor compaction, suspendMinor false
11:19:53.242 [GarbageCollectorThread-11-1] INFO  org.apache.bookkeeper.bookie.GarbageCollectorThread - Do compaction to compact those files lower than 0.2

I have removed my namespace and redeployed two versions of the services:

Will check on Monday differences between both environments.

cellcb commented 2 years ago

I encountered the same issue when I run bin/bookkeeper shell bookiesanity. I'm using the pulsar-2.10.0.

I had the same problem

moweonlee commented 2 years ago

@fmiguelez

I'm also trying to find the solution.

Could you please update anything of your finding?

What I have done is configuring these two setting in bookie's configurations.

image image

moweonlee commented 2 years ago

@fmiguelez

In my case, What frankjkelly said works for me.

I think what he had done is to minimize Compaction triggering interval for more aggressive GC.

https://github.com/apache/pulsar/issues/6639

fmiguelez commented 2 years ago

@moweonlee I have reviewed my notes of that day. The result was the same, if compaction was enabled at namespace level, bookies unreadable due to disk full.

In our systems we usually have compaction enabled globally but allow disabling it at topic level. We were surprised to check that such feature does not work anymore (I would swear to have seen it working in previous versions). We reported the issue in #15891, as such error seems to be related when trying to configure compaction at topic level.

The only solution we have found is to disable compaction at namespace level when creating it. In that case bookies disk usage does not increase uncontrollably as with compaction.

I would say that compaction is not a feature that works flawlessly in Pulsar, though on paper it sounds great to be able to work with both compacted and non-compacted topics at the same time.

moweonlee commented 2 years ago

Yes, you're right @fmiguelez.

Same issue was reproduced after couple of days and I solved my issue just by reducing retention and TTL finally.

But I'd rather swear that the issue would be reproduced with longer interval than before.

Thank you for sharing your information about compaction.

arpillai89 commented 1 year ago

@moweonlee , could you please let me know what exact config/value did you update for retention and TTL to get this issue resolved.

moweonlee commented 1 year ago

@arpillai89

Here is the configuration that I used for my pulsar cluster.

I didn't think about how much storage volume is necessary to store all data incoming.

Now it is clear that my case was pretty simple.

So it will be good to see Grafana's storage volume chart first as long as Pulsar is deployed in K8s

./bin/pulsar-admin namespaces set-retention public/default --size 8G --time -1
./bin/pulsar-admin namespaces get-retention public/default 

./bin/pulsar-admin namespaces set-message-ttl public/default --messageTTL 60
./bin/pulsar-admin namespaces get-message-ttl public/defaul
hellboy0621 commented 1 year ago

Pls run the command bin/bookkeeper shell listbookies -rw -h to get the list the ReadWrite Bookies. And check that all broker configs keep consistent managedLedgerDefaultEnsembleSize, managedLedgerDefaultWriteQuorum, managedLedgerDefaultAckQuorum.

Then check that ReadWrite Bookies number >= managedLedgerDefaultEnsembleSize >= managedLedgerDefaultWriteQuorum >= managedLedgerDefaultAckQuorum.

I deploy my pulsar demo with 1 zk, 1 bookie and 1 broker. When I create the topic the same error occurs(Not enough non-faulty bookies available). Then I checked the 3 options that all the default value is 2. When I change all the values to 1, and then restart the broker, the problem solved. Thanks.

eaba commented 1 year ago

BROKER: webSocketServiceEnabled: "true" # Not enough non-faulty bookies available

ton3r commented 10 months ago

@eaba What does your comment mean?

Enable this? Disable this?

samet-karakose commented 10 months ago

I have some problem with pulsar:latest docker image