Open hamadodene opened 4 months ago
@hamadodene It would be helpful to share Pulsar version & Bookkeeper version & possible customized Ensemble size (E), write quorum (Qw) and ack quorum (Qa) size.
@lhotari Yes, we have pulsar 3.0.3, bk 4.16.4 And for E, Qw, Qa we use ensembleSize=2, writeQuorumSize=2, ackQuorumSize=2
@hamadodene noticed this in the output that you shared:
jute.maxbuffer value is 1048575 Bytes
in Pulsar, the default is -Djute.maxbuffer=10485760
.
When you run Bookkeeper, do you use bin/pulsar bookie
to start it?
This might not be relevant in this context, but just just wondering if large ZNodes with low jute.maxbuffer value could result in inconsistencies.
When running Bookkeeper with Pulsar's bin/pulsar bookie
script, one of the main differences is that Bookkeeper will use org.apache.pulsar.metadata.bookkeeper.PulsarMetadataBookieDriver
and org.apache.pulsar.metadata.bookkeeper.PulsarMetadataClientDriver
from the Pulsar code base for metadata operations.
@hamadodene do you use offloading? I found issue https://github.com/apache/pulsar/issues/21737 which could be related in that case.
@lhotari
We don't use offload. We have our own service that wraps Bookkeeper (we create an org.apache.bookkeeper.server.EmbeddedServer
). We don't use the two classes you mentioned earlier, but we configure the metadataServiceUri
of Bookkeeper aszk+hierarchical
and the ZNode /ledgers/LAYOUT
indicates hierarchical.
We recently forced the metadataServiceUri
to be hierarchical
; previously, we were usingzk+null,
which then used the Bookkeeper default. Therefore, the layout on the ZNode
was Flat
, probably due to defaults from older versions.
This caused problems because during the update, the ledger Pulsar ZNodes were written with hierarchical layout, while other nodes were written with flat layout. Perhaps this caused the inconsistencies.
However, Bookkeeper seemed to write without errors (at least it wrote the ZNodes); perhaps the missing ledgers in the logs are those written before we fixed the layout?
The update was made from Pulsar 2.9.5 to 3.0.3 and Bookkeeper 4.14.4 to 4.16.4.
zk+null
is the safest default because it automatically adatps to the existing layout.
I suggest to use that and let the clients automatically discover.
In case it is a new cluster when you format it using zk+null
the layout will be hierarchical
@eolivelli our system is pretty old. In the znode /ledgers/LAYOUT
we had Flat
.
We use the same BK cluster for pulsar and for some other parts of our application. Our code defaulted to zk+null
.
After the pulsar upgrade we noticed that the ledgers for pulsar topics were created with the hierarchical
layout (while the ledgers created directly by us were still created with the flat
layout). This might be a problem with pulsar, maybe It forces the layout instead of reliyng on the cluster-default.
But the strange thing @hamadodene is reporting, is that pulsar was (apparently) able to publish messages on the topics, but could not read the messages because bk was throwing BKException$BKNoSuchLedgerExistsException: No such ledger exists on Bookies
.
We then forced the hierarchical
layout on the bk cluster, but bk still could not read the pulsar ledgers. Looking in the bk logs, we found no entries for the ledgers "created" before the layout switch.
Is it possible that bk was creating the znode for the ledger (with hierarchical
layout), and then silently failed to actually write because of the conflicting layout?
BUG REPORT
Describe the bug
We have noticed a strange behavior in our Bookkeeper cluster in production. In summary, we are currently unable to access the data of some ledgers that should have been created by Bookkeeper and therefore should exist. When we try to find the ledger using the Bookkeeper CLI:
However, when we try to read the ledger using the CLI:
./bookkeeper readledger -ledgerid 15543
Note:
We also checked in the entry log files, and it really seems that ledger does not exist.
Furthermore, when that ledger was created by Apache Pulsar, Pulsar did not give any errors during writing. But when trying to read the ledger, Bookkeeper responded with "No such ledger exists on Bookies."
Do you have any information on what the problem might be or how we can debug this issue?
To Reproduce
We were unable to reproduce the issue.
Expected behavior
Given that the metadata exists, I expect the ledger to actually exist on Bookkeeper as well. We have not performed any ledger deletions on Bookkeeper.
Pulsar version: 3.0.3 Bookeeper version: 4.16.4