[Feature] Spark: introduce columnarReaderFactory to support sending ColumnarBatch

apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

https://paimon.apache.org/

Apache License 2.0

2.43k stars 956 forks source link

[Feature] Spark: introduce columnarReaderFactory to support sending ColumnarBatch #825

Open zoucao opened 1 year ago

zoucao commented 1 year ago

Search before asking

[X] I searched in the issues and found nothing similar.

Motivation

Now, we use the vectorized reader to read data from parquet and orc but send the InternalRow to downstream one by one. For append-only tables or primary-key tables with full compaction, we can send ColumnarBatch to accelerating ser/deser.

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

[ ] I'm willing to submit a PR!

JingsongLi commented 1 year ago

Full-compaction: https://paimon.apache.org/docs/master/maintenance/read-performance/

JingsongLi commented 1 year ago

+1 for this feature, actually we already return a RecordReader. We can check this whether it is a ColumnarRecordReader/ConcatRecordReader, and convert it to ColumnarBatch.

zoucao commented 1 year ago

I am willing to work for it, could you assign it to me？BTW, I suggest dividing the issue in two prs, one for append-only table, the other for primary-key table with full compaction.

JingsongLi commented 1 year ago

I am willing to work for it, could you assign it to me？BTW, I suggest dividing the issue in two prs, one for append-only table, the other for primary-key table with full compaction.

Let's go~