apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.29k stars 3.59k forks source link

[Bug] Acknowledgements lost on bookies and brokers restart resulting in messages not being delivered #22709

Closed szkoludasebastian closed 1 month ago

szkoludasebastian commented 6 months ago

Search before asking

Read release policy

Version

Client version: 3.2.2, Server version: 3.2.2 On previous version also notice same behaviour, e.g. 3.1.0, 3.1.2

Minimal reproduce step

Noticed messages loss when we restart all bookies and brokers during processing the data, so data is send to some topic and then our application consumes messages from topic saves msg payload somewhere and acknowledges messages. To be more precise here are steps:

  1. Simulate data streaming - send 1000000 msgs to topic A in some time, e.g. 1000000 msgs in 2 minutes.
  2. During the time when messages are sent restart all bookies and brokers at once (in our case we have 8 bookies and 6 brokers).
  3. In the same time application is consuming messages from topic A, saves message payload somewhere and after successful save acknowledges messages.

What did you expect to see?

No message loss

What did you see instead?

Some messages are lost. So when we send 1000000 messages, in directory where we store messages we see less than 1000000. We can't specify here how much less, because it is a very random situation. Sometimes we have all the messages, but sometimes something is missing.

Anything else?

No response

Are you willing to submit a PR?

szkoludasebastian commented 2 months ago

Thank you for your commitment to analyze this problem. Do you perhaps know when these fixes and proposed change will be available for testing?

lhotari commented 2 months ago

Thank you for your commitment to analyze this problem. Do you perhaps know when these fixes and proposed change will be available for testing?

There's no published timeline yet. Since there are multiple related issues, I'm planning to create a GitHub project which will make it easier to follow the progress of individual issues.

It's possible that PIP-377 solution isn't required eventually since there could be a way to make improvements so that Key_shared subscriptions wouldn't end up losing acknowledgements during bookie and broker restarts.

Individual reproducer applications or instructions would be useful since they could help validate the solutions along the way. @szkoludasebastian Contributing a way to reproduce the issues with the given instructions consistently would be a very valuable contribution to this work. Have you had a chance to make progress on that front?

szkoludasebastian commented 2 months ago

Unfortunately not much progress in this area. I will inform here about the progress

dominikkulik commented 1 month ago

Hi @lhotari

Some time ago we proceeded to create an application that will allow us to reproduce this error. However, it turned out that we are not able to do this.

After thorough analysis, we determined that the problem was in our service. Our deduplication mechanism was not properly implemented, which led to a situation that messages which should not have been acked, were acked after restarting our service. The problem was in our cache implementation.

Multiple attempts to restart bookies and brokers by themselves further confirmed that no message was lost.

Thank you for your commitment and help. Ticket can be closed.

lhotari commented 1 month ago

Hi @lhotari

Some time ago we proceeded to create an application that will allow us to reproduce this error. However, it turned out that we are not able to do this.

After thorough analysis, we determined that the problem was in our service. Our deduplication mechanism was not properly implemented, which led to a situation that messages which should not have been acked, were acked after restarting our service. The problem was in our cache implementation.

Multiple attempts to restart bookies and brokers by themselves further confirmed that no message was lost.

Thank you for your commitment and help. Ticket can be closed.

Thanks for confirming, @dominikkulik . I'll close this issue.