apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.52k stars 1.03k forks source link

DataFusion Optimizer framework discussion #1972

Open mingmwang opened 2 years ago

mingmwang commented 2 years ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Current DataFusion's optimizer rules are complex and not easy to implement. I think one of the reason is each optimization rule has to deal with the plan tree traversing, pattern matching and rewriting. It is more a visitor model (like the old Presto optimizer rules). And if we add new logical plan nodes, especially non-leaf plan nodes, it is very possible that all the rules need to be modified to add the related visit logic.

An alternative approach is we can have the Optimizer itself drive the plan tree traversing and apply the rules down/up the tree, it will make the rule logic much clear and can focus on pattern matching and rewriting.

I see there is a new experimental optimizer framework which leverages the egg lib. https://github.com/apache/arrow-datafusion/issues/440 I'm not familiar with egg and not sure how mature the egg lib is.

Please share your thoughts.

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

mingmwang commented 2 years ago

@alamb @andygrove Please share your thoughts and insights.

alamb commented 2 years ago

Thank you for bringing this up @mingmwang. My view is that the current optimizer framework could definitely use improving 👍

t is more a visitor model (like the old Presto optimizer rules).

I think I would say the current optimizer rules "aspire" to be a visitor model; Right now the interface for an optimizer looks like this:

pub trait OptimizerRule {
    /// Rewrite `plan` to an optimized form
    fn optimize(
        &self,
        plan: &LogicalPlan,
        execution_props: &ExecutionProps,
    ) -> Result<LogicalPlan>;

    /// A human readable name for this optimizer rule
    fn name(&self) -> &str;
}

which as you say requires each optimizer rule to handle the recursion itself and intermixes the traversal from whatever rewrites are happening. There are helpers in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/utils.rs such as optimize_children and from_plan but I would say they are not particularly ergonomic

So what I had hoped to do at some point was to make an actual visitor / mutator pattern for rewriting LogicalPlan. We have done this for Expr rewriting expr_rewriter.rs and I think it works very well to separate out the structure traversal from the actual changes (see, for example, the code for simplifying expressions here: simplify_expressions.rs

cc @Dandandan and @realno who may also have idea on this

alamb commented 2 years ago

I was imagining an actual mutator like this:

trait LogicalPlanRewriter {
   pub fn rewrite(plan: LogicalPlan) -> Result<LogicalPlan>
}

that would handle the traversal into children so that each optimizer would look something like

impl LogicalPlanRewriter for MyOptimizerPass {
  fn rewrite(plan: LogicalPlan) -> Result<LogicalPlan> {
    match plan {
      // special handling for the types of plans the optimizer cares about
      LogicalPlan::Filter(..) => {...}
      ...
      // default is no rewrite
      _ => Ok(plan)
   }
}

I vaguely remember trying to write something like this up once, but as I recall I got stuck on the fact that LogicalPlans can have shared inputs (there are Arcs ?) but I can't remember exactly and it was a while ago.

A slight variant on the above that looks more like a typical vistitor would be something that handled each type of LogicalPlan and defaults to itself and could save the switch statement:

trait LogicalPlanRewriter {
   // rewrite a LogicalPlan filter node; Default implementation returns the same plan
   pub fn rewrite_filter(filter: Filter) -> Result<LogicalPlan> {
    Ok(LogicalPlan::Filter(filter)
   } 

   // rewrite a LogicalPlan table_scan node; Default implementation returns the same plan
   pub fn rewrite_filter(table_scan: TableScan) -> Result<LogicalPlan> {
    Ok(LogicalPlan::TableScan(table_scan)
   } 

   ...
}

What do you think?

alamb commented 2 years ago

If I were going to do this, I would probably sketch out a PlanRewriter and then try to port one or two of the existing optimizers to use it (without changing the Optimizer trait at first) as a proof of concept. Then as follow on PRs I would port over the remaining optimizers

jackwener commented 2 years ago

which as you say requires each optimizer rule to handle the recursion itself and intermixes the traversal from whatever rewrites are happening. There are helpers in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/utils.rs such as optimize_children and from_plan but I would say they are not particularly ergonomic

So what I had hoped to do at some point was to make an actual visitor / mutator pattern for rewriting LogicalPlan. We have done this for Expr rewriting expr_rewriter.rs and I think it works very well to separate out the structure traversal from the actual changes (see, for example, the code for simplifying expressions here: simplify_expressions.rs

I also found these problems when I enhance the optimizer.

It's a significant proposal, I'm focus this part currently.

I'm planning to add the volcano/cascades optimizer framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.

jackwener commented 2 years ago

But, It's a big change, we can do it in the long term.

Currently, I'm planning enhance the optimizer gradually,

Dandandan commented 2 years ago

I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.

Some nice things that make egg (besides written in Rust) a nice candidate for DataFusion:

realno commented 2 years ago

I definitely agree the optimizer framework needs some attention. Specifically, each rule needs to handle tree traversal which is quite tedious and also less efficient due to the repeated full tree traversal. Also it is error prone, if not careful it may introduce dependency on the order of the rules in code.

I am leaning towards deciding which framework to go with asap for the following reasons: 1. we have limited rules for now so it is small effort to change it now. 2. simplifying rule implementation can encourage more people contributing in this area. 3. this may also help benchmarking and optimization work.

I saw a few options were brought up, we can make a list for ones worth investigating.

xudong963 commented 2 years ago

I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.

I agree, egg is potential. Seems egg will generate DAGs than tree? @Dandandan

xudong963 commented 2 years ago

I'm planning to add the volcano/cascades optimizer framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.

Cascades is an optimization framework that uses a cost-based approach to explore possible executable of the search space. So I have some my thoughts:

  1. It requires a solid and as accurate as possible cost-based optimization.
  2. It is likely to lead to misguidance, which produces local convergence of the search and fails to produce an optimal solution.
  3. It will have a relatively large impact on the current codebase. (Of course, if it can be verified that there is a clear benefit in most scenarios, I support its introduction)
  4. I prefer to continue experimenting with egg, @pjmore has done many works https://github.com/apache/arrow-datafusion/pull/1485, fyi https://github.com/apache/arrow-datafusion/issues/440
jackwener commented 2 years ago

I,m sorry that I don,t know more about agg, I also try it to do something for datafusion

jackwener commented 2 years ago

@Dandandan hi, about agg, there is some design doc or some discussion about it? I want to know more about it in order to contirbute for enhancing our optimizer

jackwener commented 2 years ago

I'm planning to add the volcano/cascades optimizer framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.

Cascades is an optimization framework that uses a cost-based approach to explore possible executable of the search space. So I have some my thoughts:

  1. It requires a solid and as accurate as possible cost-based optimization.
  2. It is likely to lead to misguidance, which produces local convergence of the search and fails to produce an optimal solution.
  3. It will have a relatively large impact on the current codebase. (Of course, if it can be verified that there is a clear benefit in most scenarios, I support its introduction)
  4. I prefer to continue experimenting with egg, @pjmore has done many works Add experimental Tokomak optimizer #1485

I just mean that add a like top to down optimizer instead speacify optimizer

xudong963 commented 2 years ago

I'm planning to add the volcano/cascades optimizer framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.

Cascades is an optimization framework that uses a cost-based approach to explore possible executable of the search space. So I have some my thoughts:

  1. It requires a solid and as accurate as possible cost-based optimization.
  2. It is likely to lead to misguidance, which produces local convergence of the search and fails to produce an optimal solution.
  3. It will have a relatively large impact on the current codebase. (Of course, if it can be verified that there is a clear benefit in most scenarios, I support its introduction)
  4. I prefer to continue experimenting with egg, @pjmore has done many works Add experimental Tokomak optimizer #1485

I just mean that add a like top to down optimizer instead speacify optimizer

I know -- volcano/cascades optimizer framework both are top-down optimizers, my thoughts covered both.

Dandandan commented 2 years ago

I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.

I agree, egg is potential. Seems egg will generate DAGs than tree? @Dandandan

The current version extended by @pjmore lives here (tokomak directory): https://github.com/pjmore/arrow-datafusion/tree/tokomak-optimizer

I feel like we should move it to datafusion-contrib for better exposure and continue the work there.

alamb commented 2 years ago

Love this discussion ❤️

I really like the idea of being able to use the DataFusion optimizer logic in a general way. It would be awesome to use DataFusion with various different optimization approaches, depending on usecase.

I think if a solution can handle the existing optimizer passes it is likely to be general enough for all future optimization passes.

A required step for any of these solutions, I think, is to more carefully separate the plan traversal from the existing plan logic, so that seems like an obvious first step to take in the DataFusion code base. trying to move egg based optimization to datafusion-contrib as suggested by @Dandandan sounds like a great idea, and we can use that experience to iterate on the best interface in DataFusion

Dandandan commented 2 years ago

I pushed the latest version of @pjmore to https://github.com/datafusion-contrib/datafusion-tokomak

note that this includes the full datafusion source tree, as some changes in there are required too.

matthewmturner commented 2 years ago

I will add that repo to datafusion contrib announcement

pjmore commented 2 years ago

TLDR: egg optimizer struggles with some things which have many representations and suffers due to poor rewrite scheduling. Has some nice properties, likely able to implement cascades based framework on top of egg. Egg might be better suited to planner/physical operating like cascades does, lowering the logical plan into physical operators.

I've been researching cascades a bit recently and here are some general thoughts on optimizer stuff. This is kind of off the top of my head with regards to the egg stuff so it might be disorganized since I haven't thought about it for a bit.

  1. The current egg based optimizer works alright, but suffers heavily in certain situations due to rule scheduling issues/the number of potential queries exploding. Structured rule application like cascade does might help with that.
  2. In a related issue egg does not handle large expressions/trees well due to the large number of potential representations. This is apparent when the filter predicate in TPCH 19 which is a fairly deep and wide expression. In release mode it can take ~5 only focused on optimizing the filter to properly transform the expression from (A OR B) OR (A OR C) OR (A OR D) -> A AND (B OR C OR D). Egg does not record which rules only need to be applied once which can cause a rule like A OR B -> B OR A to be run on already processed nodes doing nothing. This is mostly an issue when rules that are going to match a lot and have to match over a large number of equivalent representations. Extremely commonly a2. In a related issue egg does not handle large expressions/trees well due to the large number of potential representations. This is apparent when the filter predicate in TPCH 19 which is a fairly deep and wide expression. In release mode it can take ~5 only focused on optimizing the filter to properly transform the expression from (A OR B) OR (A OR C) OR (A OR D) -> A AND (B OR C OR D). Egg does not record which rules only need to be applied once which can cause a rule like A OR B -> B OR A to be run on already processed nodes doing nothing. This is mostly an issue when rules that are going to match a lot and have to match over a large number of equivalent representations. Extremely commonly applied rules that are almost always positive such the boolean simplification in TPCH 19 should probably exist as specialized optimization rules as they exist in their current form as they are likely to be faster. pplied rules that are almost always positive such the boolean simplification in TPCH 19 should probably exist as specialized optimization rules as they exist in their current form as they are likely to be faster.
  3. Egg provides a lot of the properties that are desirable for a query optimizer, namely it already has a fast memo data structure and provides cost memoized cost function calculation. I'm unsure how well something like join order could would run within egg though.
  4. Egg based optimizer might be better suited to physical optimization or as a planner where table statistics can be considered.
  5. I think that the egg based optimizer and cascades actually a number of similarities, namely considering the memo datastructure that cascades uses.

    Here is an image from tikdb's article on their cascade optimizer that demonstrates general idea of a memo data structure.

memo explaination

And from egg's website, this image doesn't show up well on dark themed github so check it out on the website if you can't see it properly.

Egg's demo

The dotted lines around expression are effectively the same thing as cascade's groups. I think egg might be slightly better for this as groups are automatically merged together if they share equivalent terms, I'm not sure if this is something that is typically done in cascades based optimizer.

Another example from the CMU advanced database course. The relevant slides are around slide 34.

You can see that an egraph, which egg uses, and the memo that cascade uses have a bunch of similar properties.

I think that it is possible to build a cascade optimizer using egg as a fast memo structure. If we are going to build a more advanced optimization framework I think it makes sense to focus a bit more on the physical optimizations as nearly all of the logical optimizations could be done on the physical expressions instead, and optimizations like join order, which require table statistics/sampling, can cause order of magnitudes of difference in query performance. I think if this is a route that people are interested in pursuing this makes sense to develop a egg based planner which lowers the logical plan into a physical one, but I'd love to hear other people's thoughts on this.

alamb commented 2 years ago

One thing I think is missing in this discussion is that the cost models for plans are notoriously unreliable (especially once a plan has like 4+ joins in it) and suffer from well known, but unsolved problems like predicate correlation, data skew, and others.

So in other words, even if we had some super advanced optimization framework that could find the "lowest cost plan" it may turn out to be a very poor plan in practice (I spent many years working on an industrial query optimizer -- link below for some flavor). Real optimizers tend to have heuristics to constrain the search space and try (very) hard to avoid disaster plans / join orders.

Applying more optimizations at the physical level makes sense to me

https://ieeexplore.ieee.org/document/6816727/

pjmore commented 2 years ago

That's a great point I think that a multi-objective cost model is probably best for something like this. Possibly using histograms for costs or selectivity instead of a single number. E.g. for the classic example

SELECT * from cars where make = 'honda' and model = 'accord'

Instead of estimating selectivity with selectivity(make='honda') x selectivity(model='accord') estimating selectivity with a range of values. For example assuming a selectivity of 0.1 for both predicates the range of selectivities could be something like

   [0.1,                  0.05,                 0.01]
   perfect correlation  - partial correlation - no-correlation  

But this seems like such an obvious solution that I'm positive that there are probably a number of fundamental issues with it beyond the of the selectivity being made into a N-dimensional distribution where N is the number of predicates.

Regardless I think that a good starting point since datafusion is so general is a fairly simple cost model based of potential performance and risk that focuses on not generating disastrous plans over picking the potentially fastest one. Possibly through heavily penalizing physical operators with large variations in performance when the estimated row counts are wrong. So for the hypothetical cost model with two cost objectives

struct cost{
    cost: u64,
   risk: u64
} 

And a hypothetical choice:

NestedLoopJoin: cost{cost: 10, risk: 20}
    <Some filtered table> Estimated number of rows:  5
    <Other filtered table> Estimated number of rows:  10

HashJoin: cost{cost: 20, risk: 5}
    <Some filtered table> Estimated number of rows:  5
    <Other filtered table> Estimated number of rows:  10

So in this case since the risk of the NestedLoopJoin is so much higher it wouldn't be chosen since it has a much higher risk associated with it compared to the HashJoin.

Another good paper about the general quality of query optimizers: https://www.vldb.org/pvldb/vol9/p204-leis.pdf

The more I look at how hard writing a good query optimizer is, let alone making it extendable, the more daunting it gets.

alamb commented 2 years ago

I would love to try and avoid such issues by moving as many choices as possible to runtime. Specifically for Join implementation choice I would love to try to implement something like this

https://www.semanticscholar.org/paper/A-Generalized-Join-Algorithm-Graefe/2df1e7f01621a21061c5ef44e9cca7f0cfdbb332

I don't have a great story for join ordering (though I am working on a potential paper on the topic of more dynamic join orderings)

alamb commented 2 years ago

But this seems like such an obvious solution that I'm positive that there are probably a number of fundamental issues with it beyond the of the selectivity being made into a N-dimensional distribution where N is the number of predicates.

The typical problem here is skewed data (e.g. one value having 50% of the rows) -- and do you pick equi width or equi height histograms, which both have tradeoffs

crepererum commented 2 years ago

Egg

@pjmore great write up, thank you. I have a few notes on your observations regarding egg:

  1. It might be worth sharing them with the egg developers, they have a quite active discussion forum and seem to be quite eager to hear what people build using egg (or where they fail)
  2. I think the explosion of graph sizes can be handled somewhat by tuning the cost function, by match guards (like the ones that are also used to to prevent mis-optimization of division by zero) or by pruning (their math examples uses this trick)
  3. The (A OR B) -> (B OR A) rule you bring up is actually an interesting one. When experimenting with egg I found that these symmetry rules are really helpful to write a maintainable, short, and easy to understand rules set. Imagine the rule (A OR B) AND (A OR C) -> A OR (B AND C). This rule would only work properly if you combine it with the symmetry rule or you end up enumerating all "permutations" of the simplification rule. Now we might conclude that the symmetry rule should only be applied once or not on processed nodes, however for nesting it's quite important to be able to re-apply it to new nodes as well.
  4. I think the egg Analysis trait is extremely powerful, even a bit too complex since it can do multiple things like analysis (which then can be used by rules to prevent optimizations, see this example), complex optimization passes (e.g. constant folding) and pruning.

Multi-objective optimization

Another note multi-objective optimizations:

From experience as well as mid-level understanding of optimization in general, having multiple optimizations targets is always tricky (but depending on the use case not unsolvable).

If you really wanna keep the goals separate, then comparing solutions to your issue can only be done by strict comparison (e.g. plan A is better than B if it has lower cast AND risk). More technically, you stick to the Pareto Front.

However this often steals you quite some potential, since massive improvements in one metric might result in slight disadvantages in another metric (e.g. you might be able shave of 90% cost by increasing the risk by 5%). Now most people would intuitively decide to take the risk here. So how do you balance the two metrics? What's often done is a linear combination (like total = A * risk + B * cost, A and B being tuning parameters). This obviously has its own issue (like all metrics must "scale" in the same linear way, tuning parameters are tricky to get right).

Also see Multiple-criteria decision-making.

liurenjie1024 commented 2 years ago

Also some other problems with egg:

  1. cascades optimizers is not just enumeration of equivalent logical/physical plans, it has some techniques for branch-bounding, e.g. skip alternatives with higher cost
  2. Some algorithms may need to customize memo data structure, such as materialized view matching algorithm: https://dl.acm.org/doi/abs/10.1145/1066157.1066179
mingmwang commented 2 years ago

Nice discussion!

I see many people raised the points to build the volcano/cascades style planner for DataFusion, either on Egg or to build from scratch. To implement a cascades style planner, usually there are three parts: 1: The optimizer framework, including the memo structure which consists of group of expressions, the search and prune tasks, duplicate detection etc. 2: Rule sets consist of the transformation rules and implementation rules. 3: Cost model, stats estimation.

The whole search and prune algorithm is performed by the specific tasks. It defines five different types of optimization tasks: Group Optimization; Group Exploration; Expression Optimization; Input Optimization; Rule Application.

image

Compared to a traditional heuristic planner, a cascades style optimizer is much more complex. But the most difficult and challenge work is in the Part 3, the stats estimation. Even we can collect all the necessary stats like rows counts, cardinality and histogram for all the columns, make them correct and up to date, it is still hard to derive the correct stats from bottom leaf operators(table scan) to the top operators. There are many papers regarding stats estimation, lot of maths and magic.

And in production, the cascade style optimizer is also hard to triage and debug, because of the complexity of the search space, the unreliable stats, or just because there is some bug in the rule.implementation, it is hard to debug and figure out why the planner choose such a plan.

Because of the unreliable stats, it is hard to come up with a real optimal plan before execution. Some other engines take another approach and try to adaptively adjust the execution plan in runtime, SparkSQL's AQE is an example. Another example is Snowflake's adaptive aggregation operator placement: https://patents.justia.com/patent/20210089533 It is not that general but the adaptive execution can solve the major physical plan selection problem.

Today, some open source Query/DB engines had implemented the volcano/cascades optimizer like Apache Calcite, Greenplum Orca. They take many years to make them solid and mature. For Apache calcite, most engines just leverage Calcite as a SQL parser and heuristic planner, Hive try to leverage Calcite to implement an CBO optimizer, but I don't think it is a success story.

My opinion for DataFusion's optimizer framework is that we should continue focus on the heuristic planner approach in current phase, implement an optimizer framework like SparkSQL's catalyst optimizer, make it relatively easy to add new rules. In future, we can go with the adaptive execution approach.

alamb commented 2 years ago

My opinion for DataFusion's optimizer framework is that we should continue focus on the heuristic planner approach in current phase, implement an optimizer framework like SparkSQL's catalyst optimizer, make it relatively easy to add new rules. In future, we can go with the adaptive execution approach.

Thank you @mingmwang for the writeup. I would second your assertion that almost all successful real world (e.g. commerical) query optimizers are not implemented with a cascades like framework, but instead are some combination of heuristics and cost models.

I also think the point that cost models have unsolved error propagation issues -- my experience was that after about 2-3 joins, the output cardinality estimation is basically a guess, even with advanced statistics like histograms.

What I would like to see in DataFusion is:

  1. A solid "classic" heuristic optimizer as a default
  2. Sufficient extension points that anyone who wants to experiment / create / use a different optimizer strategy can easily do so.

In my mind this is like LLVM -- provides "state of the art" foundation and then users can customize as they need.