Open RobertIndie opened 2 years ago
I think I have found the root cause.
The root cause may from the WriteCache. If the size of the entry exceeds half of the dbStorage_writeCacheMaxSizeMb
, the error will occur.
And here is the detail: From the log we can see that the write cache is full when writing the first chunk of the large message:
2022-10-18T14:55:04,033+0800 [pulsar-io-19-1] DEBUG org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:60392] Received send message request. producer: 0:0 standalone-14-0:0 size: 5242922, partition key is: null, ordering key is null
2022-10-18T14:55:04,042+0800 [pulsar-io-19-1] DEBUG org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:60392] Received send message request. producer: 0:0 standalone-14-0:0 size: 5242922, partition key is: null, ordering key is null
2022-10-18T14:55:04,056+0800 [BookieWriteThreadPool-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Set master key. ledger: 15
2022-10-18T14:55:04,058+0800 [BookieWriteThreadPool-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - isFenced. ledger: 15
2022-10-18T14:55:04,059+0800 [BookieWriteThreadPool-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - hasLimboState. ledger: 15
2022-10-18T14:55:04,059+0800 [BookieWriteThreadPool-OrderedExecutor-0-0] DEBUG org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Add entry. 15@0, lac = -1
2022-10-18T14:55:04,059+0800 [BookieWriteThreadPool-OrderedExecutor-0-0] INFO org.apache.bookkeeper.bookie.storage.ldb.SingleDirectoryDbLedgerStorage - Write cache is full, triggering flush
2022-10-18T14:55:04,068+0800 [pulsar-io-19-1] DEBUG org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:60392] Received send message request. producer: 0:0 standalone-14-0:0 size: 5242922, partition key is: null, ordering key is null
2022-10-18T14:55:04,070+0800 [pulsar-io-19-1] DEBUG org.apache.pulsar.broker.service.ServerCnx - [/127.0.0.1:60392] Received send message request. producer: 0:0 standalone-14-0:0 size: 166, partition key is: null, ordering key is null
But in fact, no entries were written successfully.
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
...
If the first entry already exceeds the maxCacheSize of the WriteCache, it will not be added to the WriteCache because of this logic:
And the maxCacheSize is half of the dbStorage_writeCacheMaxSizeMb
:
https://github.com/apache/bookkeeper/blob/30bdedc25a59aa7d4df3f5c0962095a574f0d653/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/storage/ldb/SingleDirectoryDbLedgerStorage.java#L169
So we need to make sure dbStorage_writeCacheMaxSizeMb
is greater than 2 * maxEntrySize.
I think it's a bug from the WriteCache. I will give a fix later.
The issue had no activity for 30 days, mark with Stale label.
Search before asking
Version
Master branch, at least 08f5766d95034ce27c44ee30e4734d2a8f078e11
Minimal reproduce step
What did you expect to see?
Success to publish large messages
What did you see instead?
Throw these exceptions with dead loop:
Anything else?
The problem may be related to this configuration:
It works fine when I set it to use
SortedLedgerStorage
.For more context see: https://github.com/apache/pulsar/pull/17985#issue-1402829309
Are you willing to submit a PR?