Closed aahmed-se closed 3 months ago
I am not too familiar with Pulsar, but if it has a partition / offset based scheme like Kafka or Kinesis, then it should be pretty straightforward to add using the same framework.
it has a similar concept with cursors.
We have use cases to integrate with our internal version of Pulsar. @aahmed-se Are you working on this? If not, I was going to take a stab at it.
@niketh it's on my radar, but don't have the bandwidth right now. You can go ahead.
It makes sense to implement this as a separate task and supervisor service. The characteristics of pulsar are different as compared to kafka. For example, running pulsar in a shared subscription, the consumers can read messages from any partition. The need for maintaining per consumer offsets etc are not needed.
@aahmed-se Sounds good, I will take a stab at it.
Hey all my use case also needed to integrate druid with pulsar . Wanted to know did you guys have any luck doing that ?
Anything I can do to help out with this? I'm also very interested. :)
@joshuadunham You can take a stab and implementing it bet to look at kinesis or kafka ingest integration in druid and replicate it , it should not be that difficult.
Hi, I am working on this and having some progress with the Reader interface of pulsar client.
https://github.com/rueian/incubator-druid/commits/pulsar-indexing-service
May I open a PR after finishing the test cases?
@rueian are you making changes to pulsar for this, if not then the code should be posted in druid itself.
@rueian Please do go ahead and open a PR, especially if you have something working & well tested. Thanks for your interest in contributing!
Any updates on this one? I'm new to both Druid (recently heard about it never used it) and Apache Pulsar (this one at least I used), and I would love to do a test run but I heavily depend on Pulsar. If this is forgotten about I might research myself and try to do something but it's gonna take a lot of time until I figure things.
Sorry for no follow up from me. However I am still blocked by something else. Please feel free to step in.
@niketh Were you able to make any progress on this? My team is very interested in this capability.
I'm interested in taking this up.
My research leads to believe the best way to implement this is building on the SeekableStream abstraction. Now on the surface this may appear like an impedance mismatch as Pulsar is primarily built around managing offsets/consumer state on the broker side but I still think the SeekableStream approach is best for Druid because it best suits it's notions of tasks and segments.
The way I think this should be implemented is to have the supervisor create a task per partition of the Pulsar topic, each task will then use an exclusive, non-durable subscription that consumes from that specific partition. In this way seeking to a specific message ID can be supported cleanly, which is required to support task resumption and idempotency.
This will result in one segment per task however so users of this indexing service will likely want to enable compaction.
@sijie does this approach sound correct to you?
/cc @gianm @jsun98 @dclim as you gentlemen worked on the SeekableStream abstraction and would appreciate your thoughts.
@josephglanville Yes. The approach looks right to me.
An alternative approach is to integrate Druid with Pulsar via Kafka-on-Pulsar by leverage Druid's existing Kafka integration.
each task will then use an exclusive, non-durable subscription that consumes from that specific partition
Does this mean the topic would be non-persistent? I'm wanting to better understand what is meant by "non-durable" subscription.
@devinbost The topic persistence is set at creation
Persistent Topic: persistent://property/cluster/namespace/topic
All messages will be stored on disk
Non persistent: non-persistent://property/cluster/namespace/topic
Messages will not be persisted to Disk
The durability refers to the cursor persistence. Durable
subscriptions have the cursor persisted
If a broker restarts from a failure, it can recover the cursor from the persistent storage (BookKeeper), so that messages can continue to be consumed from the last consumed position.
While non-durable
the cursor is lost on bookie restart.
Once a broker stops, the cursor is lost and can never be recovered, so that messages can not continue to be consumed from the last consumed position.
Since druid's supervisor creates short lived tasks and we want to support resumption from any point in the stream, we require a using the Reader
Interface, which uses a default non-durable
subscription mode. So to properly support resiliency and idempotent resumption, we need to use the non-durable
subscription.
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
We'd still like this feature.
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
Be able to index records by reading a set of pulsar topics.
https://pulsar.apache.org/