Add apache pulsar based indexing service

aahmed-se commented 5 years ago

Be able to index records by reading a set of pulsar topics.

https://pulsar.apache.org/

gianm commented 5 years ago

I am not too familiar with Pulsar, but if it has a partition / offset based scheme like Kafka or Kinesis, then it should be pretty straightforward to add using the same framework.

aahmed-se commented 5 years ago

it has a similar concept with cursors.

https://streaml.io/blog/cursors-in-pulsar

niketh commented 5 years ago

We have use cases to integrate with our internal version of Pulsar. @aahmed-se Are you working on this? If not, I was going to take a stab at it.

aahmed-se commented 5 years ago

@niketh it's on my radar, but don't have the bandwidth right now. You can go ahead.

niketh commented 5 years ago

It makes sense to implement this as a separate task and supervisor service. The characteristics of pulsar are different as compared to kafka. For example, running pulsar in a shared subscription, the consumers can read messages from any partition. The need for maintaining per consumer offsets etc are not needed.

niketh commented 5 years ago

@aahmed-se Sounds good, I will take a stab at it.

chariot1498 commented 5 years ago

Hey all my use case also needed to integrate druid with pulsar . Wanted to know did you guys have any luck doing that ?

joshuadunham commented 4 years ago

Anything I can do to help out with this? I'm also very interested. :)

aahmed-se commented 4 years ago

@joshuadunham You can take a stab and implementing it bet to look at kinesis or kafka ingest integration in druid and replicate it , it should not be that difficult.

rueian commented 4 years ago

Hi, I am working on this and having some progress with the Reader interface of pulsar client.

https://github.com/rueian/incubator-druid/commits/pulsar-indexing-service

May I open a PR after finishing the test cases?

aahmed-se commented 4 years ago

@rueian are you making changes to pulsar for this, if not then the code should be posted in druid itself.

gianm commented 4 years ago

@rueian Please do go ahead and open a PR, especially if you have something working & well tested. Thanks for your interest in contributing!

haris-zynka commented 4 years ago

Any updates on this one? I'm new to both Druid (recently heard about it never used it) and Apache Pulsar (this one at least I used), and I would love to do a test run but I heavily depend on Pulsar. If this is forgotten about I might research myself and try to do something but it's gonna take a lot of time until I figure things.

rueian commented 4 years ago

Sorry for no follow up from me. However I am still blocked by something else. Please feel free to step in.

devinbost commented 3 years ago

@niketh Were you able to make any progress on this? My team is very interested in this capability.

josephglanville commented 3 years ago

I'm interested in taking this up.

My research leads to believe the best way to implement this is building on the SeekableStream abstraction. Now on the surface this may appear like an impedance mismatch as Pulsar is primarily built around managing offsets/consumer state on the broker side but I still think the SeekableStream approach is best for Druid because it best suits it's notions of tasks and segments.

The way I think this should be implemented is to have the supervisor create a task per partition of the Pulsar topic, each task will then use an exclusive, non-durable subscription that consumes from that specific partition. In this way seeking to a specific message ID can be supported cleanly, which is required to support task resumption and idempotency.

This will result in one segment per task however so users of this indexing service will likely want to enable compaction.

@sijie does this approach sound correct to you?

/cc @gianm @jsun98 @dclim as you gentlemen worked on the SeekableStream abstraction and would appreciate your thoughts.

sijie commented 3 years ago

@josephglanville Yes. The approach looks right to me.

An alternative approach is to integrate Druid with Pulsar via Kafka-on-Pulsar by leverage Druid's existing Kafka integration.

devinbost commented 2 years ago

each task will then use an exclusive, non-durable subscription that consumes from that specific partition

Does this mean the topic would be non-persistent? I'm wanting to better understand what is meant by "non-durable" subscription.

dat-vikash commented 2 years ago

@devinbost The topic persistence is set at creation

Persistent Topic: persistent://property/cluster/namespace/topic
All messages will be stored on disk

Non persistent: non-persistent://property/cluster/namespace/topic
Messages will not be persisted to Disk

The durability refers to the cursor persistence. Durable subscriptions have the cursor persisted

If a broker restarts from a failure, it can recover the cursor from the persistent storage (BookKeeper), so that messages can continue to be consumed from the last consumed position.

While non-durable the cursor is lost on bookie restart.

Once a broker stops, the cursor is lost and can never be recovered, so that messages can not continue to be consumed from the last consumed position.

Since druid's supervisor creates short lived tasks and we want to support resumption from any point in the stream, we require a using the Reader Interface, which uses a default non-durable subscription mode. So to properly support resiliency and idempotent resumption, we need to use the non-durable subscription.

github-actions[bot] commented 1 year ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

devinbost commented 1 year ago

We'd still like this feature.

github-actions[bot] commented 4 months ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

github-actions[bot] commented 3 months ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

apache / druid

Add apache pulsar based indexing service #7030