Support structure IO format on Spark

Definitions

Structure input formats specifically mean ORC file and Parquet file.

Current Status

Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.

For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.

For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.

Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.

Parquet Support Architecture Overview on DCE

parquet_architecture

ORC loader follows similar procedure.

How to add support for spark pipeline

Read support

The RecordBatch in the previous arch is an arrow RecordBatch. Spark already adds supports to transform Dataset to RDD[ArrowPayload] (see Dataset.scala), though not publicly.

It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.

Impl details to add read support

Use SparkSession to read Parquet or Orc File(spark pipeline currently uses SparkContext)
Implements toArrowPayload in flume-rumtime as Spark doesn't expose that publicly
Reuse and refactoring current PythonFromRecordBatchProcessor
Modify Bigflow's planner to use PythonFromRecordBatchProcessor for Spark pipeline's structure input when constructing Flume task

Write support

Bigflow uses its own sinker impl to write PCollection(or PType) into external target.

Current impl on DCE should also works on Spark. Although, some additional work is needed, namely:

Refactoring current ParquetSinker and Arrow Schema Converter
Add write support for ORC files. (ORC's cpp API is adding write support incrementally)

References

Apache Arrow is a promising in-memory columnar storage, we can leverage more power on it. See Arrow SlideShare

cc @himdd @chunyang-wen @bb7133 @acmol for comments and prs are appreciated

baidu / bigflow