Open mingmcb opened 1 year ago
A single bookie failure shouldn’t have that type of impact unless it drops the number of active/writable bookies below the write/ack quorum. Can you share the number of bookies in your cluster and the write and ack quorum settings ?
There are 4 bookies configured for each environment. We also have the following configure that requires 3 bookie up running. In another word, only 1 bookie is allowed to be down without service interruption.
managedLedgerDefaultEnsembleSize: "3"
# Number of copies to store for each message
managedLedgerDefaultWriteQuorum: "3"
# Number of guaranteed copies (acks to wait before write is complete)
managedLedgerDefaultAckQuorum: "3"
The issue had no activity for 30 days, mark with Stale label.
I think we are facing the same problem. This issue might also be related.
We face this issue from time to time, when due to engineer, we encounter and schema incompatibility between the received messages and the one consumer expects. We have disabled schema validations on broker side by choice. However, rather than just sending those messages to DLQ, Pulsar client side fails validation and goes into a producer creating loop.
Here is the example code to replicate the issue.
We had a look at the client code and the problem seems to be in AutoProduceBytesSchema.encode(byte[] message)
method since the requireSchemaValidation
is flipped to true in the constructor, even though broker side validation is disabled.
We were able to "fix" this problem when in ConsumerImpl.initDeadLetterProducerIfNeeded()
instead of :
((ProducerBuilderImpl<byte[]>) client.newProducer(Schema.AUTO_PRODUCE_BYTES(schema)))
we created the producer with as:
((ProducerBuilderImpl<byte[]>) client.newProducer(Schema.BYTES))
and in ConsumerImpl.processPossibleToDLQ(MessageIdAdv messageId)
instead of
producerDLQ.newMessage(Schema.AUTO_PRODUCE_BYTES(message.getReaderSchema().get()))
we sent the message with
producerDLQ.newMessage(Schema.AUTO_PRODUCE_BYTES(Schema.BYTES))
I am not sure if this is proper way to fix it, as it leaves the DLQ with binary schema in registry. This would be fine for us, but not sure if someone else is relying on it having a more descriptive schema. However, I would like to imagine DLQ as a dumping ground that should be able to accept all types of messages.
If this is an OK fix, I can create a pull request for it.
edit:
The issue had no activity for 30 days, mark with Stale label.
Search before asking
Version
2.11
Minimal reproduce step
Use pulsar java client library to create a consumer with dlq prodcer.
What did you expect to see?
extra producer should not be created if there is an issue on pulsar
What did you see instead?
created over 10000 producers and eventually exceed the limits
Anything else?
see logs
Are you willing to submit a PR?