apache / bookkeeper

Apache BookKeeper - a scalable, fault tolerant and low latency storage service optimized for append-only workloads
https://bookkeeper.apache.org/
Apache License 2.0
1.91k stars 903 forks source link

Bookie ledger replication error lead to increased disk usage #1908

Open codelipenghui opened 5 years ago

codelipenghui commented 5 years ago

BUG REPORT

Problem happens after reboot bookie(Bare metal), i can't remember a little bit about the detailed steps at the time.

Found that disk usage continued growth: image

Then check the bookie log, found some ledger replicate error log:

00:02:15.246 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L426399 E4139-E4139, Sent to [172.30.92.15:3181, 172.30.92.16:3181], Heard from [] : bitset = {}. First unread entry is -1
00:02:15.246 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error reading ledger entry: 4139
org.apache.bookkeeper.client.BKException$BKNoSuchEntryException: No such entry
    at org.apache.bookkeeper.client.BKException.create(BKException.java:84) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerFragmentReplicator$3.readComplete(LedgerFragmentReplicator.java:305) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:797) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.22.Final.jar:4.1.22.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
00:02:15.246 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.proto.BookkeeperInternalCallbacks - Error in multi callback : -13
00:02:15.259 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L426399 E4140-E4140, Sent to [172.30.92.15:3181, 172.30.92.16:3181], Heard from [] : bitset = {}. First unread entry is -1
00:02:15.259 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error reading ledger entry: 4140
org.apache.bookkeeper.client.BKException$BKNoSuchEntryException: No such entry
    at org.apache.bookkeeper.client.BKException.create(BKException.java:84) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerFragmentReplicator$3.readComplete(LedgerFragmentReplicator.java:305) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:797) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.22.Final.jar:4.1.22.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
00:02:15.259 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.proto.BookkeeperInternalCallbacks - Error in multi callback : -13
00:02:15.262 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L426399 E4141-E4141, Sent to [172.30.92.15:3181, 172.30.92.16:3181], Heard from [] : bitset = {}. First unread entry is -1
00:02:15.262 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error reading ledger entry: 4141
org.apache.bookkeeper.client.BKException$BKNoSuchEntryException: No such entry
    at org.apache.bookkeeper.client.BKException.create(BKException.java:84) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerFragmentReplicator$3.readComplete(LedgerFragmentReplicator.java:305) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:797) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.22.Final.jar:4.1.22.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
00:02:15.262 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.proto.BookkeeperInternalCallbacks - Error in multi callback : -13
00:02:15.269 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L426399 E4142-E4142, Sent to [172.30.92.15:3181, 172.30.92.16:3181], Heard from [] : bitset = {}. First unread entry is -1
00:02:15.269 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error reading ledger entry: 4142
org.apache.bookkeeper.client.BKException$BKNoSuchEntryException: No such entry
    at org.apache.bookkeeper.client.BKException.create(BKException.java:84) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerFragmentReplicator$3.readComplete(LedgerFragmentReplicator.java:305) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:797) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.22.Final.jar:4.1.22.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
00:02:15.269 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.proto.BookkeeperInternalCallbacks - Error in multi callback : -13
00:02:15.275 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L426399 E4143-E4143, Sent to [172.30.92.15:3181, 172.30.92.16:3181], Heard from [] : bitset = {}. First unread entry is -1
00:02:15.275 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error reading ledger entry: 4143
org.apache.bookkeeper.client.BKException$BKNoSuchEntryException: No such entry
    at org.apache.bookkeeper.client.BKException.create(BKException.java:84) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerFragmentReplicator$3.readComplete(LedgerFragmentReplicator.java:305) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:797) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.22.Final.jar:4.1.22.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
00:02:15.276 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.proto.BookkeeperInternalCallbacks - Error in multi callback : -13
00:02:15.280 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.PendingReadOp - Read of ledger entry failed: L426399 E4144-E4144, Sent to [172.30.92.15:3181, 172.30.92.16:3181], Heard from [] : bitset = {}. First unread entry is -1
00:02:15.280 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error reading ledger entry: 4144
org.apache.bookkeeper.client.BKException$BKNoSuchEntryException: No such entry
    at org.apache.bookkeeper.client.BKException.create(BKException.java:84) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerFragmentReplicator$3.readComplete(LedgerFragmentReplicator.java:305) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:797) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.22.Final.jar:4.1.22.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
00:02:15.280 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.proto.BookkeeperInternalCallbacks - Error in multi callback : -13
00:02:15.280 [BookKeeperClientWorker-OrderedExecutor-15-0] ERROR org.apache.bookkeeper.client.LedgerFragmentReplicator - BK error replicating ledger fragments for ledger: 426399
org.apache.bookkeeper.client.BKException$BKLedgerRecoveryException: Error while recovering ledger
    at org.apache.bookkeeper.client.BKException.create(BKException.java:78) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerFragmentReplicator$SingleFragmentCallback.processResult(LedgerFragmentReplicator.java:357) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerFragmentReplicator$1.processResult(LedgerFragmentReplicator.java:184) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.proto.BookkeeperInternalCallbacks$MultiCallback.doCallback(BookkeeperInternalCallbacks.java:244) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.proto.BookkeeperInternalCallbacks$MultiCallback.callback(BookkeeperInternalCallbacks.java:236) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.proto.BookkeeperInternalCallbacks$MultiCallback.tick(BookkeeperInternalCallbacks.java:218) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.proto.BookkeeperInternalCallbacks$MultiCallback.processResult(BookkeeperInternalCallbacks.java:254) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerFragmentReplicator$3.readComplete(LedgerFragmentReplicator.java:306) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.LedgerHandle$6.onFailure(LedgerHandle.java:797) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:38) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.common.concurrent.FutureEventListener.accept(FutureEventListener.java:26) [org.apache.bookkeeper-bookkeeper-common-4.7.2.jar:4.7.2]
    at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736) [?:1.8.0_181]
    at java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_181]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_181]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.22.Final.jar:4.1.22.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]
00:02:15.280 [ReplicationWorker] WARN  org.apache.bookkeeper.replication.ReplicationWorker - BKLedgerRecoveryException while replicating the fragment
org.apache.bookkeeper.client.BKException$BKLedgerRecoveryException: Error while recovering ledger
    at org.apache.bookkeeper.client.BKException.create(BKException.java:78) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.BookKeeperAdmin.replicateLedgerFragment(BookKeeperAdmin.java:1058) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.client.BookKeeperAdmin.replicateLedgerFragment(BookKeeperAdmin.java:1036) ~[org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:254) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.replication.ReplicationWorker.rereplicate(ReplicationWorker.java:207) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at org.apache.bookkeeper.replication.ReplicationWorker.run(ReplicationWorker.java:168) [org.apache.bookkeeper-bookkeeper-server-4.7.2.jar:4.7.2]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.22.Final.jar:4.1.22.Final]
    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_181]

This won't always happen, I also restarted bookie before. I don't know why lost these four entries[E4139 - E4144] in bookie and ledger. And this is dangerous that ledger replicate so many times.

After we disable autorecovery and find the topic use this ledger, then delete the topic, entry log was successfully cleaned up

sijie commented 5 years ago

@codelipenghui thank you for reporting this.

I think there are two problems behind this issue.

1) why the disk usage keeps growing? 2) why the four entries are not available?

I think the disk usage kept growing because replication worker keeps trying to replicate fragments but fail at the last 4 entries. when replication worker keeps doing that, it is writing entries again and again. so the disk usage is going up. this can be resolved when we fix #1066

for the second problem, it is an interesting problem. my guess is it is related to ledger metadata update during closing. but I have to take a deeper look before coming to any conclusions.

jiazhai commented 5 years ago

would like to put it into 4.11

Raven888888 commented 2 years ago

Faced this issue in our 3-node setup, but it is always the same 1 node that faced this problem. Disk grows uncontrollably and the node died out of storage. Had to bookieformat every time. But after some time it will happen again.

Pulsar 2.10.1

How can I fix the issue?

Raven888888 commented 2 years ago

This is what found in logs

Bookie

2022-10-20T09:53:23,089+0000 [db-storage-cleanup-10-1] INFO  org.apache.bookkeeper.bookie.storage.ldb.EntryLocationIndex - Deleting indexes for ledgers: [501062, 501893, 501923, 501378, 501922, 501921, 500143, 500142, 501901, 500141, 501898, 501896, 501911, 501047, 501366, 501909, 501908, 501055, 501918, 501054, 501916, 501913, 501912]
2022-10-20T09:53:23,754+0000 [db-storage-cleanup-10-1] WARN  org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Failed to cleanup db indexes
java.io.IOException: Error in RocksDB compact
    at org.apache.bookkeeper.bookie.storage.ldb.KeyValueStorageRocksDB.compact(KeyValueStorageRocksDB.java:295) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.storage.ldb.EntryLocationIndex.removeOffsetFromDeletedLedgers(EntryLocationIndex.java:256) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage.lambda$checkpoint$7(SingleDirectoryDbLedgerStorage.java:638) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_342]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_342]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_342]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_342]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_342]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_342]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
    at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
Caused by: org.rocksdb.RocksDBException: block checksum mismatch: expected 459167422, got 2336252423  in data/bookkeeper/ledgers/current/locations/000423.sst offset 612461 size 32254
    at org.rocksdb.RocksDB.compactRange(Native Method) ~[org.rocksdb-rocksdbjni-6.10.2.jar:?]
    at org.rocksdb.RocksDB.compactRange(RocksDB.java:3167) ~[org.rocksdb-rocksdbjni-6.10.2.jar:?]
    at org.rocksdb.RocksDB.compactRange(RocksDB.java:3135) ~[org.rocksdb-rocksdbjni-6.10.2.jar:?]
    at org.apache.bookkeeper.bookie.storage.ldb.KeyValueStorageRocksDB.compact(KeyValueStorageRocksDB.java:293) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    ... 10 more
2022-10-20T09:54:23,026+0000 [SyncThread-7-1] ERROR org.apache.bookkeeper.bookie.SyncThread - Exception flushing ledgers
java.io.IOException: Failed to flush RocksDB batch
    at org.apache.bookkeeper.bookie.storage.ldb.KeyValueStorageRocksDB$RocksDBBatch.flush(KeyValueStorageRocksDB.java:451) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage.checkpoint(SingleDirectoryDbLedgerStorage.java:621) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage.checkpoint(DbLedgerStorage.java:231) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.SyncThread.checkpoint(SyncThread.java:152) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.SyncThread.lambda$doCheckpoint$0(SyncThread.java:99) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_342]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_342]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_342]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_342]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_342]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_342]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
    at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
Caused by: org.rocksdb.RocksDBException: block checksum mismatch: expected 459167422, got 2336252423  in data/bookkeeper/ledgers/current/locations/000423.sst offset 612461 size 32254
    at org.rocksdb.RocksDB.write0(Native Method) ~[org.rocksdb-rocksdbjni-6.10.2.jar:?]
    at org.rocksdb.RocksDB.write(RocksDB.java:1586) ~[org.rocksdb-rocksdbjni-6.10.2.jar:?]
    at org.apache.bookkeeper.bookie.storage.ldb.KeyValueStorageRocksDB$RocksDBBatch.flush(KeyValueStorageRocksDB.java:449) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    ... 12 more
2022-10-20T09:54:28,204+0000 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
2022-10-20T09:54:28,206+0000 [main-EventThread] INFO  org.apache.bookkeeper.meta.ZkLedgerUnderreplicationManager - Latch countdown due to ZK event: WatchedEvent state:SyncConnected type:NodeChildrenChanged path:/ledgers/underreplication/locks
2022-10-20T09:55:23,019+0000 [SyncThread-7-1] ERROR org.apache.bookkeeper.bookie.SyncThread - Exception flushing ledgers
java.io.IOException: Failed to flush RocksDB batch
    at org.apache.bookkeeper.bookie.storage.ldb.KeyValueStorageRocksDB$RocksDBBatch.flush(KeyValueStorageRocksDB.java:451) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage.checkpoint(SingleDirectoryDbLedgerStorage.java:621) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.storage.ldb.DbLedgerStorage.checkpoint(DbLedgerStorage.java:231) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.SyncThread.checkpoint(SyncThread.java:152) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at org.apache.bookkeeper.bookie.SyncThread.lambda$doCheckpoint$0(SyncThread.java:99) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_342]
    at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_342]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_342]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_342]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_342]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_342]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.74.Final.jar:4.1.74.Final]
    at java.lang.Thread.run(Thread.java:750) [?:1.8.0_342]
Caused by: org.rocksdb.RocksDBException: block checksum mismatch: expected 459167422, got 2336252423  in data/bookkeeper/ledgers/current/locations/000423.sst offset 612461 size 32254
    at org.rocksdb.RocksDB.write0(Native Method) ~[org.rocksdb-rocksdbjni-6.10.2.jar:?]
    at org.rocksdb.RocksDB.write(RocksDB.java:1586) ~[org.rocksdb-rocksdbjni-6.10.2.jar:?]
    at org.apache.bookkeeper.bookie.storage.ldb.KeyValueStorageRocksDB$RocksDBBatch.flush(KeyValueStorageRocksDB.java:449) ~[org.apache.bookkeeper-bookkeeper-server-4.14.4.jar:4.14.4]
    ... 12 more

broker

2022-10-20T13:34:33,566+0000 [pulsar-ph-kafka-79-21] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/kafka/persistent/hourly-datetime-Ingest-partition-0
2022-10-20T13:34:33,569+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/kafka/persistent/hourly-datetime-Ingest-partition-0] Closing managed ledger
2022-10-20T13:34:33,569+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] INFO  org.apache.pulsar.broker.PulsarService - No ledger offloader configured, using NULL instance
2022-10-20T13:34:33,569+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/kafka/persistent/hourly-datetime-Ingest
2022-10-20T13:34:33,574+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/kafka/persistent/hourly-datetime-Ingest] Closing managed ledger
2022-10-20T13:34:33,574+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] ERROR io.streamnative.pulsar.handlers.kop.KafkaTopicManager - [[id: 0x864f8b07, L:/192.168.3.3:9092 - R:/192.168.3.2:33020]]Get empty non-partitioned topic for name persistent://public/kafka/hourly-datetime-Ingest
2022-10-20T13:34:33,574+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] ERROR io.streamnative.pulsar.handlers.kop.KafkaTopicManager - [[id: 0x54c1cef3, L:/192.168.3.3:9092 - R:/192.168.3.2:33024]]Get empty non-partitioned topic for name persistent://public/kafka/hourly-datetime-Ingest
2022-10-20T13:34:33,675+0000 [pulsar-ph-kafka-79-21] INFO  org.apache.pulsar.broker.PulsarService - No ledger offloader configured, using NULL instance
2022-10-20T13:34:33,675+0000 [pulsar-ph-kafka-79-21] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/kafka/persistent/hourly-datetime-Ingest-partition-0
2022-10-20T13:34:33,679+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/kafka/persistent/hourly-datetime-Ingest-partition-0] Closing managed ledger
2022-10-20T13:34:33,679+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] INFO  org.apache.pulsar.broker.PulsarService - No ledger offloader configured, using NULL instance
2022-10-20T13:34:33,679+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - Opening managed ledger public/kafka/persistent/hourly-datetime-Ingest
2022-10-20T13:34:33,684+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] INFO  org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl - [public/kafka/persistent/hourly-datetime-Ingest] Closing managed ledger
2022-10-20T13:34:33,684+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] ERROR io.streamnative.pulsar.handlers.kop.KafkaTopicManager - [[id: 0x864f8b07, L:/192.168.3.3:9092 - R:/192.168.3.2:33020]]Get empty non-partitioned topic for name persistent://public/kafka/hourly-datetime-Ingest
2022-10-20T13:34:33,684+0000 [bookkeeper-ml-scheduler-OrderedScheduler-11-0] ERROR io.streamnative.pulsar.handlers.kop.KafkaTopicManager - [[id: 0x54c1cef3, L:/192.168.3.3:9092 - R:/192.168.3.2:33024]]Get empty non-partitioned topic for name persistent://public/kafka/hourly-datetime-Ingest
2022-10-20T13:34:33,685+0000 [pulsar-web-35-9] INFO  org.eclipse.jetty.server.RequestLog - 192.168.3.3 - - [20/Oct/2022:13:34:33 +0000] "GET /admin/v2/persistent/public/kafka/hourly-datetime-Ingest/partitions HTTP/1.1" 200 16 "-" "Pulsar-Java-v2.10.0" 1
Hongten commented 1 year ago

Any update on this issue? I encountered the same error. Pulsar-2.10.1

Thanks.

mnit016 commented 1 year ago

Hi there, I'm facing a similar issue in Pulsar-2.9.5, bookie ledger utilization increased to 99.9%. But I can't see any ledger information in error logs there Could some one please tell me how to get over this issue? any workaround?

Details logs: Several days ago, a lot of "Entering Safepoint region..." "Leaving safepoint region..." in logs After a day, I found this "Exception ledger flush" / "Error in Rocksdb put" starting in logs. image And 4 hours later, it turned to a lots "Error during flush" image Above logs happen until "Ledger directory ... is out-of-space" image

jyotirmoy112 commented 8 months ago

This issue we also recently faced within our cluster. It happens if you restart any bookie without touching autorecovery. At that time if there is any ledger is trying to replicate it and if you don't have enough writeable bookie available. The replication goes into infinite loop even later all the bookie is available. Yes, this is a bug which needs a permanent fix. The temporary fix we applied figured out how many ledgers are making noise. Just make a note of all faulty ledgers and remove it safely. Afterthat the storage will start making free.

Capture