apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.13k stars 3.57k forks source link

bookie server's log error: Failed to restore rocksdb 000000000000000000/000000000000000001/000000000000000000 #5668

Closed kaykillyou closed 4 years ago

kaykillyou commented 4 years ago

Describe the bug i have a pulsar instance consisting of two clusters, there is a bookie server background log in each cluster that continuously outputs the following error: Failed to restore rocksdb 000000000000000008/000000000000000001/000000000000000000 Now producer and consumer are stopped, and bookie's heap memory usage reaches 90%.

To Reproduce [root@iZbp1h590z6jc4f0ib1fjyZ logs]# pwd /opt/bookieCluster/logs [root@iZbp1h590z6jc4f0ib1fjyZ logs]# tail -f pulsar-bookie-iZbp1h590z6jc4f0ib1fjyZ.log

get error: 11:55:49.583 [io-write-scheduler-OrderedScheduler-1-0] WARN org.apache.bookkeeper.stream.storage.impl.sc.ZkStorageContainerManager - Failed to start storage container (8) java.util.concurrent.CompletionException: org.apache.bookkeeper.statelib.api.exceptions.StateStoreException: Failed to restore rocksdb 000000000000000008/000000000000000001/000000000000000000 at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_221] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_221] at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943) ~[?:1.8.0_221] at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) ~[?:1.8.0_221] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_221] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_221] at org.apache.bookkeeper.statelib.impl.journal.AbstractStateStoreWithJournal.lambda$executeIO$16(AbstractStateStoreWithJournal.java:474) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_221] at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111) [com.google.guava-guava-21.0.jar:?] at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58) [com.google.guava-guava-21.0.jar:?] at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75) [com.google.guava-guava-21.0.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_221] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_221] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_221] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_221] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221] Caused by: org.apache.bookkeeper.statelib.api.exceptions.StateStoreException: Failed to restore rocksdb 000000000000000008/000000000000000001/000000000000000000 at org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.RocksCheckpointer.restore(RocksCheckpointer.java:84) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.kv.RocksdbKVStore.loadRocksdbFromCheckpointStore(RocksdbKVStore.java:161) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.kv.RocksdbKVStore.init(RocksdbKVStore.java:223) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.journal.AbstractStateStoreWithJournal.lambda$initializeLocalStore$5(AbstractStateStoreWithJournal.java:202) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.journal.AbstractStateStoreWithJournal.lambda$executeIO$16(AbstractStateStoreWithJournal.java:471) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] ... 12 more Caused by: java.io.FileNotFoundException: 000000000000000008/000000000000000001/000000000000000000/checkpoints/7723596f-fd11-4342-93a1-69cd30afa9dc/metadata at org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.dlog.DLCheckpointStore.openInputStream(DLCheckpointStore.java:92) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.RocksCheckpointer.getLatestCheckpoint(RocksCheckpointer.java:117) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.RocksCheckpointer.restore(RocksCheckpointer.java:52) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.kv.RocksdbKVStore.loadRocksdbFromCheckpointStore(RocksdbKVStore.java:161) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.kv.RocksdbKVStore.init(RocksdbKVStore.java:223) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.journal.AbstractStateStoreWithJournal.lambda$initializeLocalStore$5(AbstractStateStoreWithJournal.java:202) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.journal.AbstractStateStoreWithJournal.lambda$executeIO$16(AbstractStateStoreWithJournal.java:471) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] ... 12 more

I want to know what caused the issue.How should I solve it?

jiazhai commented 4 years ago

@kaykillyou Thanks for reporting this. If you are not using table service, you could pass in -nss(no stream store service) parameter when start broker or standalone.

we will try to reproduce this issue and track it.

kaykillyou commented 4 years ago

@jiazhai Thanks for your reply! today my bookie server found this issue again, when i restart bookie service ,it also hapen. And each bookie's heap memory usage reaches 97%. I try to start the broker with command "./pulsar-daemon start broker -nss", then failed! I don’t know if I use it like this!

Looking forward to your reply!

log error: 11:27:48.175 [BookieReadThreadPool-OrderedExecutor-1-0] INFO org.apache.bookkeeper.proto.ReadEntryProcessorV3 - No ledger found while reading entry: 0 from ledger: 118609 11:27:48.175 [BookieReadThreadPool-OrderedExecutor-4-0] INFO org.apache.bookkeeper.proto.ReadEntryProcessorV3 - No ledger found while reading entry: 0 from ledger: 116820 11:27:48.176 [BookieReadThreadPool-OrderedExecutor-1-0] INFO org.apache.bookkeeper.proto.ReadEntryProcessorV3 - No ledger found while reading entry: 0 from ledger: 118609 11:27:48.176 [ReplicationWorker] INFO org.apache.bookkeeper.replication.ReplicationWorker - Ledger replicated successfully. ledger id is: 118609 11:27:48.188 [io-write-scheduler-OrderedScheduler-1-0] INFO org.apache.distributedlog.BKDistributedLogManager - Create sync reader starting from DLSN{logSegmentSequenceNo=1, entryId=0, slotId=0} 11:27:48.197 [io-write-scheduler-OrderedScheduler-1-0] ERROR org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.RocksCheckpointer - Failed to restore rocksdb 000000000000000006/000000000000000001/000000000000000000 java.io.FileNotFoundException: 000000000000000006/000000000000000001/000000000000000000/checkpoints/781fb64f-dcc8-4ec6-bd33-2b2e0b7d8e0c/metadata at org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.dlog.DLCheckpointStore.openInputStream(DLCheckpointStore.java:92) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.RocksCheckpointer.getLatestCheckpoint(RocksCheckpointer.java:117) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.rocksdb.checkpoint.RocksCheckpointer.restore(RocksCheckpointer.java:52) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.kv.RocksdbKVStore.loadRocksdbFromCheckpointStore(RocksdbKVStore.java:161) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.kv.RocksdbKVStore.init(RocksdbKVStore.java:223) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.journal.AbstractStateStoreWithJournal.lambda$initializeLocalStore$5(AbstractStateStoreWithJournal.java:202) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at org.apache.bookkeeper.statelib.impl.journal.AbstractStateStoreWithJournal.lambda$executeIO$16(AbstractStateStoreWithJournal.java:471) ~[org.apache.bookkeeper-statelib-4.9.2.jar:4.9.2] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_144] at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:111) [com.google.guava-guava-21.0.jar:?] at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:58) [com.google.guava-guava-21.0.jar:?] at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:75) [com.google.guava-guava-21.0.jar:?] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [?:1.8.0_144] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [?:1.8.0_144] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [?:1.8.0_144] at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [?:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_144] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_144] at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-all-4.1.32.Final.jar:4.1.32.Final] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_144] 11:27:48.198 [io-write-scheduler-OrderedScheduler-1-0] INFO org.apache.bookkeeper.stream.storage.impl.store.MVCCStoreFactoryImpl - Clearing resources hold by stream(1)/range(0) at storage container (6) 11:27:48.198 [io-write-scheduler-OrderedScheduler-1-0] WARN org.apache.bookkeeper.stream.storage.impl.sc.StorageContainerRegistryImpl - De-registered StorageContainer ('6') when failed to start java.util.concurrent.CompletionException: org.apache.bookkeeper.statelib.api.exceptions.StateStoreException: Failed to restore rocksdb 000000000000000006/000000000000000001/000000000000000000 at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) ~[?:1.8.0_144] at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) ~[?:1.8.0_144] at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943) ~[?:1.8.0_144] at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926) ~[?:1.8.0_144] at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) ~[?:1.8.0_144] at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) ~[?:1.8.0_144] at org.apache.bookkeeper.statelib.impl.journal.AbstractStateStoreWithJournal.lambda$executeIO$16(AbstractStateStoreWithJourn

jiazhai commented 4 years ago

@kaykillyou Sorry for the late. If you were starting broker by ./pulsar-daemon start broker, then I guess you were running in a cluster mode. In cluster mode '-nss' will not take effect. besides ./pulsar-daemon start broker, was there any other conf file changes for broker.conf, and which command is used to start a bookie, was there any change for bookkeeper.conf? looked into the code, seems by default, cluster mode will not start function worker and storage store.

kaykillyou commented 4 years ago

@jiazhai i start bookie with command "./pulsar-daemon start bookie", i found to configurate function "extraServerComponents=org.apache.bookkeeper.stream.server.StreamStorageLifecycleComponent " for the bookkeeper.conf, now i close it and seems normal now, I will observe it for a while. thanks very much!

jiazhai commented 4 years ago

Thanks @kaykillyou for your report. Since this is not a blocking issue. will put it in a low priority. we will try to reproduce and fix the issue later.

kaykillyou commented 4 years ago

@jiazhai OK, look forward to the next version update.

archfish commented 4 years ago

Pulsar 2.5.0 still have this issue,running in standalone mode. Now i start pulsar with -nss parameter hope normal.

MattMacGillivray commented 4 years ago

seeing this issue in 2.5.1, standalone mode, as well.

larshelge commented 4 years ago

Same issue in 2.5.2.

MattMacGillivray commented 4 years ago

Starting with the -nss flag (no stream storage) will resolve the issue temporarily...