apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.4k stars 1.27k forks source link

feature: support GCP's PubSub as realtime ingestion #6556

Open vmarchaud opened 3 years ago

vmarchaud commented 3 years ago

GCP's pubsub implemention doesn't have any kind of partition/offset system so current LLC implementation obviously can't work with it. If people are interested here is an implementation that essentially "fake" offsets and always pull latest messages. However if for some reason the server is stopped while a segment is consuming, the plugin will not re-ingest events that were lost

We also tried to implement it as a HLC (implementation is here) but its pretty much unusable since segments created in HL mode aren't saved to deep storage so you must rely on replicas created through pinot (not handled in the implementation we've made but could be by creating multiple subs).

Since both approach are obviously not doable in production, we have decided to move on to a different pubsub system (most likely apache pulsar) so i don't actually request it to be implemented, i'm pretty much opening it so people looking for this will not waste weeks of research :)

More discussion on slack:

mcvsubbu commented 3 years ago

Thanks, @vmarchaud . Does pubsub partition the stream, or is it all in one partition? If there is only one partition, then that is another reason why it may not work in some production use cases with very high ingestion rate. We will be limited by the consumption capacity of a single server to consume the complete stream (this is another reason we moved away from HLC). On the other hand, a partitioned stream allows us to spread the consumption across multiple servers, thus allowing us to scale horizontally as ingestion rate increases.

vmarchaud commented 3 years ago

Does pubsub partition the stream, or is it all in one partition

Yeah it's pretty much only one partition, at least from all the doc i read about it. However you can totally throw multiple workers on the same "partition" because the system will only allow to pull messages on one worker that hasn't picked pulled by another (of course it might re-deliver if a worker miss what they call a "ack deadline" which is around ~15s).

I guess the model is fondamentaly different, kafka and generally other pubsub will expose more logic to the consumer side where as GCP's pubsub will handle pretty much everything itself.

mcvsubbu commented 3 years ago

This is related to Issue #6302 . This one too, started off by asking for pause/commit/restart functionality but we concluded that it is not going to solve production problems. The right way to do this is to support a stream with this characteristics