apache / pinot

Apache Pinot - A realtime distributed OLAP datastore
https://pinot.apache.org/
Apache License 2.0
5.53k stars 1.3k forks source link

Segment reader on the deep store for compute frameworks like Presto and Spark #7036

Open yupeng9 opened 3 years ago

yupeng9 commented 3 years ago

There are senairos that users need to run complex adhoc queries (e.g. multi-way join with other datasets) or ETL jobs that scan all Pinot segments. Pinot servers are not designed this pattern of workloads. And even we allow this type of scan over the segments, such queries can significant degrade server performance and affect other important online queries.

As an alternative, compute frameworks like Presto and Spark can directly scan the segments stored on Pinot's deep store like HDFS or S3, if there is a segment reader available.

atris commented 3 years ago

Folks, this is relevant to a use case we are evaluating Pinot for. Can I take this issue up?

mayankshriv commented 3 years ago

@yupeng9 Is the proposal to have Presto/Spark engines read Pinot segment format? The segment format in Pinot is optimized to assist the query execution engine of Pinot. If the proposal is to bypass the query execution entirely, what value does the format itself provide to other engines? In other words, why would running Presto/Spark on the pinot format be better than running on ORC/Parquet? One benefit I see is that you have single source of data for Pinot vs Spark/Presto, but wanted to check if there are other benefits too.

mcvsubbu commented 3 years ago

Would having a feature to "backup" data (I found there is no opposite to "ingest". The closest nice word was maybe disgorge :) from pinot to deep store in some (configurable) format? Minion jobs could do that.

yupeng9 commented 3 years ago

@atris You are welcome to take this. There are existing implementations of segment reading in the Pinot server, so an important part of this work is to abstract the interface out. So it will be good to share the interface design first with the community and reach an agreement.

@mayankshriv Yes, apart from the isolation benefits that I mentioned in the issue description, a single source of data is another one, so that users do not need to ingest the Kafka stream into other data sources like Hive and can directly query Pinot segments. We have seen asks like this, particularly on the upsert tables, since the ingested Hive table does not have upsert semantics.

xiangfu0 commented 3 years ago

I actually found an interface of RecordReader and the implementation of PinotSegmentRecordReader. Shall we just try to extend this?

yupeng9 commented 3 years ago

Yes, I think it's possible. We could extend it to read from other sources like HDFS, also we can enhance it with the FilterScan.

atris commented 3 years ago

I am happy to take this task

mcvsubbu commented 3 years ago

If we push back to hdfs in some well known format, then the advantage is that we don't have to write a pinot connector for each the other pieces that want to read this data.

yupeng9 commented 3 years ago

@mcvsubbu That's another option on the table. But do you suggest some async job to manage this like an ETL job, or make it part of data back to deep store? Does that mean a table owner can choose the backup format like Parquet?

mcvsubbu commented 3 years ago

@yupeng9 A configured minion job works. Alternatively, a periodic task in the grid also works (since we already have an interface to pull segments from Pinot). Of course, Pinot's realtime segments will not be a part of this, but I am not sure that is a requirement.

mayankshriv commented 3 years ago

Why not just retain the original data (format) from which Pinot segments were generated in the first place?

On Jun 26, 2021, at 7:54 AM, Subbu Subramaniam @.***> wrote:

 @yupeng9 A configured minion job works. Alternatively, a periodic task in the grid also works (since we already have an interface to pull segments from Pinot). Of course, Pinot's realtime segments will not be a part of this, but I am not sure that is a requirement.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mcvsubbu commented 3 years ago

Why not just retain the original data (format) from which Pinot segments were generated in the first place? On Jun 26, 2021, at 7:54 AM, Subbu Subramaniam @.***> wrote:  @yupeng9 A configured minion job works. Alternatively, a periodic task in the grid also works (since we already have an interface to pull segments from Pinot). Of course, Pinot's realtime segments will not be a part of this, but I am not sure that is a requirement. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

If those segments are already there, then this issue is moot. I am assuming that they are not there, probably because we have a realtime only table. (Or, auto-moving the completed realtime to the offline table).

yupeng9 commented 3 years ago

Not sure I fully followed this. For real-time only table, we still have the sealed segments stored in the deep store, right? In Uber's case, the deep store is HDFS, so there are already segments available. With the reader, compute frameworks can directly query those.

Also, we do not use minion at Uber now, so having a reader integration with Presto/Spark could save this ETL job. Also, I assume we need this reader for minion job to work, right?

mcvsubbu commented 3 years ago

I should not have said "segments". @mayankshriv was asking about the data in original format before it is pushed to pinot (avro/orc/parquet/whatever), and that is what I was writing about.

icefury71 commented 3 years ago

@mcvsubbu @mayankshriv from what I understand, the data source is realtime only in Uber's case (for the corresponding use case). Yes, you could argue that it can be persisted directly from Kafka -> Parquet via some ETL job but that will be redundant and operationally expensive. Since we already have all the data sitting in deep store (in Pinot format), exposing read API on such archived segments makes sense to me.

mcvsubbu commented 3 years ago

@icefury71 I was not suggesting that we etl from kafka -> parquet. I was suggesting that we provide a job to to reverse conversion -- pinot segment to avro/parquet/orc/whatever. This should be fairly easy to write, and under Pinot control. If we do this, we don't need to provide a specialized reader for spark/presto/trino/whatever.

yupeng9 commented 3 years ago

@mcvsubbu By Pinot control, do you mean a minion job? Or is it possible to make this part of the segment sealing and saving in deep store?

mcvsubbu commented 3 years ago

By Pinot control, I mean we have full control of the source as well as how it is used, APIs, etc.

yupeng9 commented 3 years ago

We saw some other use cases that need this segment reader. Usually an ingestion pipeline can be set up from Kafka to Hive. However, such ingestion set up is difficult for upsert table, especially the partial upsert, because it's sometimes not straightforward to implement the partial upsert logic in Hive ETL to derive the same full values from the raw Hive dataset from Kafka. It would be easier to just read the segments directly from Pinot, as Pinot derives the full values using partial upsert logic.