apache / pulsar

Apache Pulsar - distributed pub-sub messaging system
https://pulsar.apache.org/
Apache License 2.0
14.12k stars 3.57k forks source link

[Doc] Message Id Uniqueness? #18835

Open larshp opened 1 year ago

larshp commented 1 year ago

Search before asking

What issue do you find in Pulsar docs?

In https://pulsar.apache.org/docs/next/concepts-messaging/

The Message ID is described as

The message ID of a message is assigned by bookies as soon as the message is persistently stored. Message ID indicates a message’s specific position in a ledger and is unique within a Pulsar cluster.

In the binary protocol it is: https://github.com/apache/pulsar/blob/master/pulsar-common/src/main/proto/PulsarApi.proto#L57-L67

Ie. the message ID is comprised of various fields, it is unclear which fields defines the uniqueness.

What is your suggestion?

Looking at https://pulsar.apache.org/docs/2.10.x/pulsar-admin/ it uses ledgerId & entryId for some of the commands, if this is exactly the unique part of the message id, it can be added to the documentation for clarity.

Any reference?

No response

Are you willing to submit a PR?

tisonkun commented 1 year ago

cc @merlimat @BewareMyPower

BewareMyPower commented 1 year ago

Since the ledger id and entry id are not exposed to users (in pulsar-client-api module), there is no need to document that.

Actually, the MessageId objects returned by send, receive and getLastMessageId are unique for the triple (ledger id, entry id, batch index). However, there is no public API to get these fields. You have to use the specific implementations like MessageIdImpl and BatchMessageIdImpl to access these fields. These implementations are very messy and might change. See more discussions here

larshp commented 1 year ago

I think its important to document the design of the software so that users can validate and understand the data and workings of the platform.

A user of the Python client will see https://pulsar.apache.org/api/python/2.10.x/pulsar.html#MessageId.__init__ and be exposed to the concepts of ledger and entry id

A user of https://pulsar.apache.org/docs/2.10.x/pulsar-admin/, will have to enter the ledger and entry id for some commands

That the Message ID is composed of fields referred in other places in the documentation/clients/CLI is unclear, plus which fields make up the unique part of it is unclear.

BewareMyPower commented 1 year ago

A user of the Python client will see https://pulsar.apache.org/api/python/2.10.x/pulsar.html#MessageId.__init__ and be exposed to the concepts of ledger and entry id

That's the point I mentioned in the mail list. Though the original authors that wrote the Java client don't want to expose these "details", authors of many other clients expose these so-called "details".

A user of https://pulsar.apache.org/docs/2.10.x/pulsar-admin/, will have to enter the ledger and entry id for some commands

It makes sense to me. I saw the get-message-by-id API just now and confused about it. I doubt if this API is reasonable.

A message could be stored in an entry that can be located uniquely by the ledger id and the entry id. However, a message could also be stored across multiple entries or multiple messages are stored in a single entry. I need to look deeper into this API's semantics.

Maybe it's worth opening another discussion about this topic. Or do you think it's better to continue discussing in https://lists.apache.org/thread/rdkqnkohbmkjjs61hvoqplhhngr0b0sd?

BewareMyPower commented 1 year ago

After checking the get-last-message-by-id again, I found this is a very limited API that it can only get last message id of a non-partitioned topic. I changed my mind. We should expose these fields (so-called details) to users. I will open another discussion soon.

In short, this issue makes sense to me.

github-actions[bot] commented 1 year ago

The issue had no activity for 30 days, mark with Stale label.

tisonkun commented 1 year ago

@BewareMyPower is the new MessageIdAdv or related changes solve this issue?

BewareMyPower commented 1 year ago

@tisonkun No. Users don't need to know which fields define the uniqueness of a MessageId. If the MessageId instances were retrieved from receive are guaranteed to be different. There is also a public compareTo method to compare two MessageId instances.

I think the main problem of the issue is that many pulsar-admin APIs require users to provide the ledger id and the entry id. Unfortunately, we still need the batch index field for these APIs to locate a unique message.

First, in the documents, we can tell users how to retrieve these fields, it would be easy to do that with the help of MessageIdAdv. (Maybe we can provide code examples, but let's wait 3.0.0 is released) Second, we should improve these APIs to accept the batch index.

github-actions[bot] commented 1 year ago

The issue had no activity for 30 days, mark with Stale label.