Open TakaHiR07 opened 2 years ago
We encountered the same problem. When the AutoRecovery service is started with Bookie, it will find that the current Ledger ID has been deleted, but does Auto Recovery have no ability to automatically skip these Ledgers, or has it been trying to recover the deleted Ledgers?
12:42:41.841 [main-EventThread] INFO org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
12:42:41.846 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while performing readLac from ledger: 8049631
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 8049631 not found
at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.Bookie.getExplicitLac(Bookie.java:1364) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:71) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_232]
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.67.Final.jar:4.1.67.Final]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_232]
12:42:41.846 [BookieReadThreadPool-OrderedExecutor-0-0] ERROR org.apache.bookkeeper.proto.ReadLacProcessorV3 - No ledger found while trying to read last entry: 8049631
org.apache.bookkeeper.bookie.Bookie$NoLedgerException: Ledger 8049631 not found
at org.apache.bookkeeper.bookie.LedgerDescriptor.createReadOnly(LedgerDescriptor.java:52) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.HandleFactoryImpl.getReadOnlyHandle(HandleFactoryImpl.java:61) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.bookie.Bookie.readEntry(Bookie.java:1441) ~[org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.getReadLacResponse(ReadLacProcessorV3.java:86) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.proto.ReadLacProcessorV3.safeRun(ReadLacProcessorV3.java:118) [org.apache.bookkeeper-bookkeeper-server-4.12.0.jar:4.12.0]
at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36) [org.apache.bookkeeper-bookkeeper-common-4.12.0.jar:4.12.0]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_232]
We are using version 4.12 of Bookie
16:02:17.008 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - ledgerID: 8049631
16:02:17.015 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=4, writeQuorumSize=3, ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:, ensembles={0=[xxxx]}, customMetadata={component=base64:bWFuYWdlZC1sZWRnZXI=, pulsar/managed-ledger=base64:cHVsc2FxxxxGl0bG9nX2ZpbmRlci1wYXJ0aXRpb24tMjk=, application=base64:cHVsc2Fy}}
16:02:17.008 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - ledgerID: 8049631 16:02:17.015 [main] INFO org.apache.bookkeeper.tools.cli.commands.client.LedgerMetaDataCommand - LedgerMetadata{formatVersion=3, ensembleSize=4, writeQuorumSize=3, ackQuorumSize=2, state=OPEN, digestType=CRC32C, password=base64:, ensembles={0=[xxxx]}, customMetadata={component=base64:bWFuYWdlZC1sZWRnZXI=, pulsar/managed-ledger=base64:cHVsc2FxxxxGl0bG9nX2ZpbmRlci1wYXJ0aXRpb24tMjk=, application=base64:cHVsc2Fy}}
My bookie version is 4.14.1 , and I finally solve it by
It seems bookie can skip unrecoverable ledger after pr-2870, proposed in bookie 4.14.4
BUG REPORT
Describe the bug
Our Production pulsar cluster is multiple nodes with E-Qw-Qa(3-3-2), enabling auto-recovery by "./bin/bookkeeper shell autorecovery -enable", bookkeeper version is 4.14.1 . Now one bookie server is down, and cluster do autoRecovery. However, there is a ledger can not read by the other 2 ensemble, the error is both : Ledger 1294 not found (It seems the ledger has been deleted)
But the ReplicationWorker still continue to try to rereplicate this ledger, and keep failed. According to the following log, it throw BKNotEnoughBookiesException, therefore ReplicationWorker#run would keep running, keep replicate a can-not-replicated ledger. The result is generating too much recovery read request to the other 2 ensemble bookie, affect the normal read request.
The zkmetadata has ledger 1294 under /ledgers/underreplication/ledgers
To Reproduce
Expected behavior
Should it skip those deleted ledger when doing recovery ?