Baidu Bigflow is an interface that allows for writing distributed computing programs and provides lots of simple, flexible, powerful APIs. Using Bigflow, you can easily handle data of any scale. Bigflow processes 4P+ data inside Baidu and runs about 10k jobs every day.
We should add a optimization layer in the current API layer, it could be called API Plan layer.
At this moment, API layer will transform the user's code to a LogicalPlan directly, and some information is lost, such as LogicalPlan don't know what is join, the LogicalPlan only know that two nodes are cogrouped and then a Processor will process the two cogrouped result.
We should add a optimization layer in the current API layer, it could be called API Plan layer.
At this moment, API layer will transform the user's code to a LogicalPlan directly, and some information is lost, such as LogicalPlan don't know what is
join
, the LogicalPlan only know that two nodes are cogrouped and then a Processor will process the two cogrouped result.Eg.
pc1.distinct().join(pc2)
is equal to
But we can't optimize it automatically without the help of API Plan.
So, API Plan is meant to keep all the information we can get from user's code, and optimize the plan by the information.