apache / paimon

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
https://paimon.apache.org/
Apache License 2.0
2.43k stars 956 forks source link

[Feature] Spark: introduce columnarReaderFactory to support sending ColumnarBatch #825

Open zoucao opened 1 year ago

zoucao commented 1 year ago

Search before asking

Motivation

Now, we use the vectorized reader to read data from parquet and orc but send the InternalRow to downstream one by one. For append-only tables or primary-key tables with full compaction, we can send ColumnarBatch to accelerating ser/deser.

Solution

No response

Anything else?

No response

Are you willing to submit a PR?

JingsongLi commented 1 year ago

Full-compaction: https://paimon.apache.org/docs/master/maintenance/read-performance/

JingsongLi commented 1 year ago

+1 for this feature, actually we already return a RecordReader. We can check this whether it is a ColumnarRecordReader/ConcatRecordReader, and convert it to ColumnarBatch.

zoucao commented 1 year ago

I am willing to work for it, could you assign it to me?BTW, I suggest dividing the issue in two prs, one for append-only table, the other for primary-key table with full compaction.

JingsongLi commented 1 year ago

I am willing to work for it, could you assign it to me?BTW, I suggest dividing the issue in two prs, one for append-only table, the other for primary-key table with full compaction.

Let's go~