baidu / bigflow

Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.
http://baidu.github.io/bigflow
Apache License 2.0
1.14k stars 160 forks source link

API Plan optimizations is needed #4

Open acmol opened 6 years ago

acmol commented 6 years ago

We should add a optimization layer in the current API layer, it could be called API Plan layer.

At this moment, API layer will transform the user's code to a LogicalPlan directly, and some information is lost, such as LogicalPlan don't know what is join, the LogicalPlan only know that two nodes are cogrouped and then a Processor will process the two cogrouped result.

Eg.

pc1.distinct().join(pc2)

is equal to

pc1.cogroup(pc2) \
       .apply_values(lambda p1, p2: p1.distinct().cartesian(p2)) \
       .flatten_values()

But we can't optimize it automatically without the help of API Plan.

So, API Plan is meant to keep all the information we can get from user's code, and optimize the plan by the information.