Support mass data export

egalpin commented 6 months ago

I know mass export isn’t a primary focus of Pinot (understandably, with the focus on real-time ingestion and low latency querying instead), but there are use cases where mass export would be very useful. There are alternative options, but in particular when upsert is employed and data consistency across data export + aggregate results is important, serving results from the same source of data (i.e. Pinot) would be ideal.

Latency requirements would be very different for this use case, with minutes/hours (days?) being completely acceptable. The key for an appropriate solution would be that the impact on servers and brokers would be minimized, allowing them to continue serving low latency queries.

One high-level concept would be something like: given a SQL query without any aggregations, generate minion tasks in the form of segment name + ID_SET of matching doc IDs based on the provided SQL query; each task would then have minions download the segment from server/deepstore, pluck out the matching documents based on the segments corresponding ID_SET, apply transformations from the SQL query for provided projections, and then write those resulting rows back to deepstore in CSV form (or parquet, or configurable form). This approach would place a lot of the heavy-lifting of disk seeks/long-running queries related to mass export onto minions that might otherwise be concerning for servers to handle while also handling other queries.

This issue ticket could serve as the starting place for brainstorming requirements and community interest prior to undertaking a design document.

cc @mcvsubbu @mayankshriv

vmarchaud commented 6 months ago

I've been using pinot for few years in prod right now to store customers data and this issue got us looking into alternative dbs for performing aggregations and saving results into a pinot table, so this feature is really interesting for me.

I've been wondering if another way would be to somehow transform segments to parquet files and perform aggregations with other tool like arrow datafusion ?

cbalci commented 5 months ago

@egalpin have you looked into using Spark Connector for batch reading data from Pinot? It is more efficient compared to broker queries for exporting large amounts of data. It connects directly to servers and uses the grpc streaming for memory efficiency. Also with Spark you should be able to export in popular formats easily.

kishoreg commented 5 months ago

This is a great feature and we should have this on the roadmap. cc @npawar @mayankshriv

egalpin commented 5 months ago

thanks @cbalci , I'll have a look at the Spark Connector and see if that might suit my needs as well 👍

abhijeetkushe commented 4 months ago

Is there a Spark Driver to directly read data from Cold storage like S3 ?

apache / pinot

Support mass data export #12315