baidu / bigflow

Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.
http://baidu.github.io/bigflow
Apache License 2.0
1.14k stars 160 forks source link

Support structure IO format on Spark #11

Open advancedxy opened 6 years ago

advancedxy commented 6 years ago

Definitions

Structure input formats specifically mean ORC file and Parquet file.

Current Status

Bigflow on DCE supports ORC file(only reading) and Parquet file with its own loader as DCE doesn't support reading ORC or Parquet natively.

For ORC files, Bigflow uses ORC's c++ API. As the time of adding ORC support, ORC's c++ API only supports reading.

For Parquet files, Bigflow also uses c++ API. Currently, parquet-cpp partially supports nested structure.

Bigflow on Spark doesn't support ORC neither Parquet for now. This doc lists some details how we can support for ORC and Parquet files.

Parquet Support Architecture Overview on DCE

parquet_architecture

ORC loader follows similar procedure.

How to add support for spark pipeline

Read support

The RecordBatch in the previous arch is an arrow RecordBatch. Spark already adds supports to transform Dataset to RDD[ArrowPayload] (see Dataset.scala), though not publicly.

It would be straightforward to add Parquet read support on spark pipeline, even ORC or CSV files.

Impl details to add read support

  1. Use SparkSession to read Parquet or Orc File(spark pipeline currently uses SparkContext)
  2. Implements toArrowPayload in flume-rumtime as Spark doesn't expose that publicly
  3. Reuse and refactoring current PythonFromRecordBatchProcessor
  4. Modify Bigflow's planner to use PythonFromRecordBatchProcessor for Spark pipeline's structure input when constructing Flume task

Write support

Bigflow uses its own sinker impl to write PCollection(or PType) into external target.

Current impl on DCE should also works on Spark. Although, some additional work is needed, namely:

  1. Refactoring current ParquetSinker and Arrow Schema Converter
  2. Add write support for ORC files. (ORC's cpp API is adding write support incrementally)

References

  1. Apache Arrow is a promising in-memory columnar storage, we can leverage more power on it. See Arrow SlideShare

cc @himdd @chunyang-wen @bb7133 @acmol for comments and prs are appreciated

advancedxy commented 6 years ago

@chunyang-wen https://github.com/apache/orc/pull/188 Looks like ORC finishes their writing support in C++ API.