Open Samreay opened 7 months ago
The current implementation for seeking by timestamp is here: https://github.com/apache/pulsar/blob/30697bd382da0c5a4458f3a7c71d2c9c64ee6b63/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentSubscription.java#L727-L774 called from here: https://github.com/apache/pulsar/blob/c99a51d021a627d675697656869d418d416a5e1b/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/ServerCnx.java#L1890-L1936
I guess the missed optimization is to use the ledger metadata as a first level filtering. There's a binary search, but it doesn't use the ledger metadata: https://github.com/apache/pulsar/blob/82237d3684fe506bcb6426b3b23f413422e6e4fb/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/OpFindNewest.java#L83-L137
LedgerInfo contains the timestamp when the ledger was sealed (it got closed or was rolled over): https://github.com/apache/pulsar/blob/23f46a0736e844a2a2fec943ee76d4e1e73ec477/managed-ledger/src/main/proto/MLDataFormats.proto#L55-L61
there could be an initial binary search which uses this information available in the ManagedLedgerImpl via https://github.com/apache/pulsar/blob/e6cd005f90524222df194a690718f77c4e646670/managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java#L3844-L3846
I guess there is a gotcha since the ledger's timestamp is the broker's clock, but the seek uses the message publish time which is using the client's (publisher's) clock. There might be corner cases because of this.
There's also a related issue #10488.
Search before asking
Motivation
Right now it seems that seeking a reader or a consumer to a specific timestamp is an unoptimised process that can take many seconds / over a minute for larger topics (single GB data size, tens of messages per second). From a slack comment @lhotari it appears that seeking via a timestamp is not optimised, and I'm here to propose optimising it as a valuable feature.
Solution
Seeking currently works by message ID or by timestamp. I assume (though I could be wrong) that seeking by messageID is optimised. Without going into the implementation details properly and just spitballing ideas, something like binary searching on the time, or creating a treemap from timestamp to message ID (at any level of sparsity) might allow seeking to become far faster
Alternatives
No response
Anything else?
No response
Are you willing to submit a PR?