Open mingmwang opened 2 years ago
@alamb @andygrove Please share your thoughts and insights.
Thank you for bringing this up @mingmwang. My view is that the current optimizer framework could definitely use improving 👍
t is more a visitor model (like the old Presto optimizer rules).
I think I would say the current optimizer rules "aspire" to be a visitor model; Right now the interface for an optimizer looks like this:
pub trait OptimizerRule {
/// Rewrite `plan` to an optimized form
fn optimize(
&self,
plan: &LogicalPlan,
execution_props: &ExecutionProps,
) -> Result<LogicalPlan>;
/// A human readable name for this optimizer rule
fn name(&self) -> &str;
}
which as you say requires each optimizer rule to handle the recursion itself and intermixes the traversal from whatever rewrites are happening. There are helpers in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/utils.rs such as optimize_children
and from_plan
but I would say they are not particularly ergonomic
So what I had hoped to do at some point was to make an actual visitor / mutator pattern for rewriting LogicalPlan
. We have done this for Expr
rewriting expr_rewriter.rs and I think it works very well to separate out the structure traversal from the actual changes (see, for example, the code for simplifying expressions here: simplify_expressions.rs
cc @Dandandan and @realno who may also have idea on this
I was imagining an actual mutator like this:
trait LogicalPlanRewriter {
pub fn rewrite(plan: LogicalPlan) -> Result<LogicalPlan>
}
that would handle the traversal into children so that each optimizer would look something like
impl LogicalPlanRewriter for MyOptimizerPass {
fn rewrite(plan: LogicalPlan) -> Result<LogicalPlan> {
match plan {
// special handling for the types of plans the optimizer cares about
LogicalPlan::Filter(..) => {...}
...
// default is no rewrite
_ => Ok(plan)
}
}
I vaguely remember trying to write something like this up once, but as I recall I got stuck on the fact that LogicalPlans
can have shared inputs (there are Arcs
?) but I can't remember exactly and it was a while ago.
A slight variant on the above that looks more like a typical vistitor would be something that handled each type of LogicalPlan and defaults to itself and could save the switch statement:
trait LogicalPlanRewriter {
// rewrite a LogicalPlan filter node; Default implementation returns the same plan
pub fn rewrite_filter(filter: Filter) -> Result<LogicalPlan> {
Ok(LogicalPlan::Filter(filter)
}
// rewrite a LogicalPlan table_scan node; Default implementation returns the same plan
pub fn rewrite_filter(table_scan: TableScan) -> Result<LogicalPlan> {
Ok(LogicalPlan::TableScan(table_scan)
}
...
}
What do you think?
If I were going to do this, I would probably sketch out a PlanRewriter and then try to port one or two of the existing optimizers to use it (without changing the Optimizer trait at first) as a proof of concept. Then as follow on PRs I would port over the remaining optimizers
which as you say requires each optimizer rule to handle the recursion itself and intermixes the traversal from whatever rewrites are happening. There are helpers in https://github.com/apache/arrow-datafusion/blob/master/datafusion/src/optimizer/utils.rs such as optimize_children and from_plan but I would say they are not particularly ergonomic
So what I had hoped to do at some point was to make an actual visitor / mutator pattern for rewriting LogicalPlan. We have done this for Expr rewriting expr_rewriter.rs and I think it works very well to separate out the structure traversal from the actual changes (see, for example, the code for simplifying expressions here: simplify_expressions.rs
I also found these problems when I enhance the optimizer.
It's a significant proposal, I'm focus this part currently.
I'm planning to add the volcano/cascades optimizer
framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.
But, It's a big change, we can do it in the long term.
Currently, I'm planning enhance the optimizer gradually,
I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.
Some nice things that make egg
(besides written in Rust) a nice candidate for DataFusion:
I definitely agree the optimizer framework needs some attention. Specifically, each rule needs to handle tree traversal which is quite tedious and also less efficient due to the repeated full tree traversal. Also it is error prone, if not careful it may introduce dependency on the order of the rules in code.
I am leaning towards deciding which framework to go with asap for the following reasons: 1. we have limited rules for now so it is small effort to change it now. 2. simplifying rule implementation can encourage more people contributing in this area. 3. this may also help benchmarking and optimization work.
I saw a few options were brought up, we can make a list for ones worth investigating.
I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.
I agree, egg is potential. Seems egg will generate DAGs than tree? @Dandandan
I'm planning to add the
volcano/cascades optimizer
framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.
Cascades is an optimization framework that uses a cost-based approach to explore possible executable of the search space. So I have some my thoughts:
I,m sorry that I don,t know more about agg, I also try it to do something for datafusion
@Dandandan hi, about agg, there is some design doc or some discussion about it? I want to know more about it in order to contirbute for enhancing our optimizer
I'm planning to add the
volcano/cascades optimizer
framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.Cascades is an optimization framework that uses a cost-based approach to explore possible executable of the search space. So I have some my thoughts:
- It requires a solid and as accurate as possible cost-based optimization.
- It is likely to lead to misguidance, which produces local convergence of the search and fails to produce an optimal solution.
- It will have a relatively large impact on the current codebase. (Of course, if it can be verified that there is a clear benefit in most scenarios, I support its introduction)
- I prefer to continue experimenting with egg, @pjmore has done many works Add experimental Tokomak optimizer #1485
I just mean that add a like top to down optimizer instead speacify optimizer
I'm planning to add the
volcano/cascades optimizer
framework for datafusion. After I get more familiar with datafusion, I will add more detail and RFC for this part.Cascades is an optimization framework that uses a cost-based approach to explore possible executable of the search space. So I have some my thoughts:
- It requires a solid and as accurate as possible cost-based optimization.
- It is likely to lead to misguidance, which produces local convergence of the search and fails to produce an optimal solution.
- It will have a relatively large impact on the current codebase. (Of course, if it can be verified that there is a clear benefit in most scenarios, I support its introduction)
- I prefer to continue experimenting with egg, @pjmore has done many works Add experimental Tokomak optimizer #1485
I just mean that add a like top to down optimizer instead speacify optimizer
I know -- volcano/cascades optimizer framework both are top-down optimizers, my thoughts covered both.
I believe ithat we really should continue experimenting with egg as a optimization framework. This doesn't mean we should move everything to the egg framework, but we can have one or multiple egg-based passes.
I agree, egg is potential. Seems egg will generate DAGs than tree? @Dandandan
The current version extended by @pjmore lives here (tokomak directory): https://github.com/pjmore/arrow-datafusion/tree/tokomak-optimizer
I feel like we should move it to datafusion-contrib for better exposure and continue the work there.
Love this discussion ❤️
I really like the idea of being able to use the DataFusion optimizer logic in a general way. It would be awesome to use DataFusion with various different optimization approaches, depending on usecase.
I think if a solution can handle the existing optimizer passes it is likely to be general enough for all future optimization passes.
A required step for any of these solutions, I think, is to more carefully separate the plan traversal from the existing plan logic, so that seems like an obvious first step to take in the DataFusion code base. trying to move egg based optimization to datafusion-contrib as suggested by @Dandandan sounds like a great idea, and we can use that experience to iterate on the best interface in DataFusion
I pushed the latest version of @pjmore to https://github.com/datafusion-contrib/datafusion-tokomak
note that this includes the full datafusion source tree, as some changes in there are required too.
I will add that repo to datafusion contrib announcement
TLDR: egg optimizer struggles with some things which have many representations and suffers due to poor rewrite scheduling. Has some nice properties, likely able to implement cascades based framework on top of egg. Egg might be better suited to planner/physical operating like cascades does, lowering the logical plan into physical operators.
I've been researching cascades a bit recently and here are some general thoughts on optimizer stuff. This is kind of off the top of my head with regards to the egg stuff so it might be disorganized since I haven't thought about it for a bit.
I think that the egg based optimizer and cascades actually a number of similarities, namely considering the memo datastructure that cascades uses.
Here is an image from tikdb's article on their cascade optimizer that demonstrates general idea of a memo data structure.
And from egg's website, this image doesn't show up well on dark themed github so check it out on the website if you can't see it properly.
The dotted lines around expression are effectively the same thing as cascade's groups. I think egg might be slightly better for this as groups are automatically merged together if they share equivalent terms, I'm not sure if this is something that is typically done in cascades based optimizer.
Another example from the CMU advanced database course. The relevant slides are around slide 34.
You can see that an egraph, which egg uses, and the memo that cascade uses have a bunch of similar properties.
I think that it is possible to build a cascade optimizer using egg as a fast memo structure. If we are going to build a more advanced optimization framework I think it makes sense to focus a bit more on the physical optimizations as nearly all of the logical optimizations could be done on the physical expressions instead, and optimizations like join order, which require table statistics/sampling, can cause order of magnitudes of difference in query performance. I think if this is a route that people are interested in pursuing this makes sense to develop a egg based planner which lowers the logical plan into a physical one, but I'd love to hear other people's thoughts on this.
One thing I think is missing in this discussion is that the cost models for plans are notoriously unreliable (especially once a plan has like 4+ joins in it) and suffer from well known, but unsolved problems like predicate correlation, data skew, and others.
So in other words, even if we had some super advanced optimization framework that could find the "lowest cost plan" it may turn out to be a very poor plan in practice (I spent many years working on an industrial query optimizer -- link below for some flavor). Real optimizers tend to have heuristics to constrain the search space and try (very) hard to avoid disaster plans / join orders.
Applying more optimizations at the physical level makes sense to me
That's a great point I think that a multi-objective cost model is probably best for something like this. Possibly using histograms for costs or selectivity instead of a single number. E.g. for the classic example
SELECT * from cars where make = 'honda' and model = 'accord'
Instead of estimating selectivity with selectivity(make='honda') x selectivity(model='accord')
estimating selectivity with a range of values. For example assuming a selectivity of 0.1 for both predicates the range of selectivities could be something like
[0.1, 0.05, 0.01]
perfect correlation - partial correlation - no-correlation
But this seems like such an obvious solution that I'm positive that there are probably a number of fundamental issues with it beyond the of the selectivity being made into a N-dimensional distribution where N is the number of predicates.
Regardless I think that a good starting point since datafusion is so general is a fairly simple cost model based of potential performance and risk that focuses on not generating disastrous plans over picking the potentially fastest one. Possibly through heavily penalizing physical operators with large variations in performance when the estimated row counts are wrong. So for the hypothetical cost model with two cost objectives
struct cost{
cost: u64,
risk: u64
}
And a hypothetical choice:
NestedLoopJoin: cost{cost: 10, risk: 20}
<Some filtered table> Estimated number of rows: 5
<Other filtered table> Estimated number of rows: 10
HashJoin: cost{cost: 20, risk: 5}
<Some filtered table> Estimated number of rows: 5
<Other filtered table> Estimated number of rows: 10
So in this case since the risk of the NestedLoopJoin is so much higher it wouldn't be chosen since it has a much higher risk associated with it compared to the HashJoin.
Another good paper about the general quality of query optimizers: https://www.vldb.org/pvldb/vol9/p204-leis.pdf
The more I look at how hard writing a good query optimizer is, let alone making it extendable, the more daunting it gets.
I would love to try and avoid such issues by moving as many choices as possible to runtime. Specifically for Join implementation choice I would love to try to implement something like this
I don't have a great story for join ordering (though I am working on a potential paper on the topic of more dynamic join orderings)
But this seems like such an obvious solution that I'm positive that there are probably a number of fundamental issues with it beyond the of the selectivity being made into a N-dimensional distribution where N is the number of predicates.
The typical problem here is skewed data (e.g. one value having 50% of the rows) -- and do you pick equi width or equi height histograms, which both have tradeoffs
@pjmore great write up, thank you. I have a few notes on your observations regarding egg:
(A OR B) -> (B OR A)
rule you bring up is actually an interesting one. When experimenting with egg I found that these symmetry rules are really helpful to write a maintainable, short, and easy to understand rules set. Imagine the rule (A OR B) AND (A OR C) -> A OR (B AND C)
. This rule would only work properly if you combine it with the symmetry rule or you end up enumerating all "permutations" of the simplification rule. Now we might conclude that the symmetry rule should only be applied once or not on processed nodes, however for nesting it's quite important to be able to re-apply it to new nodes as well.Analysis
trait is extremely powerful, even a bit too complex since it can do multiple things like analysis (which then can be used by rules to prevent optimizations, see this example), complex optimization passes (e.g. constant folding) and pruning.Another note multi-objective optimizations:
From experience as well as mid-level understanding of optimization in general, having multiple optimizations targets is always tricky (but depending on the use case not unsolvable).
If you really wanna keep the goals separate, then comparing solutions to your issue can only be done by strict comparison (e.g. plan A is better than B if it has lower cast AND risk). More technically, you stick to the Pareto Front.
However this often steals you quite some potential, since massive improvements in one metric might result in slight disadvantages in another metric (e.g. you might be able shave of 90% cost by increasing the risk by 5%). Now most people would intuitively decide to take the risk here. So how do you balance the two metrics? What's often done is a linear combination (like total = A * risk + B * cost
, A
and B
being tuning parameters). This obviously has its own issue (like all metrics must "scale" in the same linear way, tuning parameters are tricky to get right).
Also see Multiple-criteria decision-making.
Also some other problems with egg:
Nice discussion!
I see many people raised the points to build the volcano/cascades style planner for DataFusion, either on Egg or to build from scratch. To implement a cascades style planner, usually there are three parts: 1: The optimizer framework, including the memo structure which consists of group of expressions, the search and prune tasks, duplicate detection etc. 2: Rule sets consist of the transformation rules and implementation rules. 3: Cost model, stats estimation.
The whole search and prune algorithm is performed by the specific tasks. It defines five different types of optimization tasks: Group Optimization; Group Exploration; Expression Optimization; Input Optimization; Rule Application.
Compared to a traditional heuristic planner, a cascades style optimizer is much more complex. But the most difficult and challenge work is in the Part 3, the stats estimation. Even we can collect all the necessary stats like rows counts, cardinality and histogram for all the columns, make them correct and up to date, it is still hard to derive the correct stats from bottom leaf operators(table scan) to the top operators. There are many papers regarding stats estimation, lot of maths and magic.
And in production, the cascade style optimizer is also hard to triage and debug, because of the complexity of the search space, the unreliable stats, or just because there is some bug in the rule.implementation, it is hard to debug and figure out why the planner choose such a plan.
Because of the unreliable stats, it is hard to come up with a real optimal plan before execution. Some other engines take another approach and try to adaptively adjust the execution plan in runtime, SparkSQL's AQE is an example. Another example is Snowflake's adaptive aggregation operator placement: https://patents.justia.com/patent/20210089533 It is not that general but the adaptive execution can solve the major physical plan selection problem.
Today, some open source Query/DB engines had implemented the volcano/cascades optimizer like Apache Calcite, Greenplum Orca. They take many years to make them solid and mature. For Apache calcite, most engines just leverage Calcite as a SQL parser and heuristic planner, Hive try to leverage Calcite to implement an CBO optimizer, but I don't think it is a success story.
My opinion for DataFusion's optimizer framework is that we should continue focus on the heuristic planner approach in current phase, implement an optimizer framework like SparkSQL's catalyst optimizer, make it relatively easy to add new rules. In future, we can go with the adaptive execution approach.
My opinion for DataFusion's optimizer framework is that we should continue focus on the heuristic planner approach in current phase, implement an optimizer framework like SparkSQL's catalyst optimizer, make it relatively easy to add new rules. In future, we can go with the adaptive execution approach.
Thank you @mingmwang for the writeup. I would second your assertion that almost all successful real world (e.g. commerical) query optimizers are not implemented with a cascades like framework, but instead are some combination of heuristics and cost models.
I also think the point that cost models have unsolved error propagation issues -- my experience was that after about 2-3 joins, the output cardinality estimation is basically a guess, even with advanced statistics like histograms.
What I would like to see in DataFusion is:
In my mind this is like LLVM
-- provides "state of the art" foundation and then users can customize as they need.
Is your feature request related to a problem or challenge? Please describe what you are trying to do. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Current DataFusion's optimizer rules are complex and not easy to implement. I think one of the reason is each optimization rule has to deal with the plan tree traversing, pattern matching and rewriting. It is more a visitor model (like the old Presto optimizer rules). And if we add new logical plan nodes, especially non-leaf plan nodes, it is very possible that all the rules need to be modified to add the related visit logic.
An alternative approach is we can have the Optimizer itself drive the plan tree traversing and apply the rules down/up the tree, it will make the rule logic much clear and can focus on pattern matching and rewriting.
I see there is a new experimental optimizer framework which leverages the egg lib. https://github.com/apache/arrow-datafusion/issues/440 I'm not familiar with egg and not sure how mature the egg lib is.
Please share your thoughts.
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.