Remove Arc<LogicalPlan> from LogicalPlan, stop copying LogicalPlans

tustvold commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Related to #4627, the current representation of LogicalPlan contains Arc<LogicalPlan> at various points, whilst this does reduce the cost of copying a LogicalPlan tree, it:

Complicates rewrites by necessitating clones
Results in double-boxing - e.g. Vec<Arc<LogicalPlan>>
Permits cycles and all the excitement that would entail
Marginal overhead from additional atomics
Unidiomatic is perhaps too strong, but it is strange for a tree datastructure to have shared ownership

Describe the solution you'd like

I would like to remove the Arc, replacing with Box where necessary. Methods that currently take Arc<LogicalPlan> should be updated to take LogicalPlan.

Describe alternatives you've considered

Additional context

This likely wants to wait until we are cloning LogicalPlan less frequently

alamb commented 1 year ago

I believe @mingmwang and @jackwener have noted this in the past -- the idea is good to me

jackwener commented 1 year ago

Hope to this change😍, it is very meaningful to optimizer.

jackwener commented 1 year ago

If we can remove Arc<LogicalPlan> and use LogicalPlan, wen can use pattern-match to match subtree pattern like:

match projection-filter

LogicalPlan::Projection(Projection{LogicalPlan::Filter, ..})

like this:

mslapek commented 1 year ago

How would OptimizerRule look after the change?

Currently we have:

/// Try and rewrite `plan` to an optimized form, returning None if the plan cannot be
/// optimized by this rule.
fn try_optimize(
    &self,
    plan: &LogicalPlan,
    config: &dyn OptimizerConfig,
) -> Result<Option<LogicalPlan>>;

With the current try_optimize signature, lack of Arc<LogicalPlan> would require to clone whole non-optimised subbranches.

alamb commented 1 year ago

How would OptimizerRule look after the change?

Maybe we could change the signature to something like

enum OptimizedPlan {
  // Optimizer did not make any changes to the original input pla
  NoChange(LogicalPlan),
  /// Optimizer rewrote the original plan
  Rewritten(LogicalPlan),
}

/// Try and rewrite `plan` to an optimized form
fn try_optimize(
    &self,
    plan: &LogicalPlan,
    config: &dyn OptimizerConfig,
) -> Result<OptimizedPlan>;

tustvold commented 1 year ago

I think you would want to make try_optimize take ownership not a borrow, and then return it

alamb commented 1 year ago

I think you would want to make try_optimize take ownership not a borrow, and then return it

I think there are cases (like deciding when a fixed point is reached) where the caller wants to distinguish between no more optimization and a new plan.

However, now that LogicalPlan supports PartialEq I think maybe we could just use that to check if any changes were made

https://github.com/apache/arrow-datafusion/blob/c37ddf72ec539bd39cce0dd4ff38db2e36ddb55f/datafusion/expr/src/logical_plan/plan.rs#L52

So the signature maybe could be

/// Try and rewrite `plan` to an optimized form
fn try_optimize(
    &self,
    plan: LogicalPlan,
    config: &dyn OptimizerConfig,
) -> Result<LogicalPlan>;

🤔

mslapek commented 1 year ago

To compare a new logical plan with the old one, we need to have both plans in the memory.

Without Arc<...>, it requires to have always doubled RAM usage for plans. With Arc<...> a slightly less? 🤔

tustvold commented 1 year ago

However, now that LogicalPlan supports PartialEq I think maybe we could just use that to check if any changes were made

This is not a particularly cheap operation, involving a lot of string comparisons. I think having the return value indicate if changes have been made as in your original example makes sense to me.

My point was if we remove the Arc we need to be careful to move LogicalPlan and avoid cloning, as any clone is then a deep clone of the entire tree. We therefore need to pass in an owned value so that it can be moved into the return type

mslapek commented 1 year ago

I think having the return value indicate if changes have been made as in your original example makes sense to me.

At the first sight the idea seems to be compelling...

Nevertheless I'm quite pessimistic about the idea. 😵‍💫

The main reason are optimisations, which UNDO other optimisations!

(sorry for the long post! but it's core architecture stuff...)

Examples of cancelling optimisations

Example 1. Commutative optimisations

Some operations are commutative, like projection or limit - it is used in both PushDownProjection and PushDownLimit - but in a different directions (grep commute in these files).

So for some plans they will always give Some(new_plan). 😕

Example 2. Undoing inside of an optimisation

Inside PushDownLimit we have:

Limit A -> Projection -> Limit B
Added a limit after the projection: Limit A -> Projection -> Limit C -> Limit B
Merge of the Limit C and Limit B in ANOTHER invocation of try_optimize: Limit A -> Projection -> Limit B

We have many invocations of try_optimize due to apply_order option.

The point is that for a fixed point (merge of Limit C + Limit B is Limit B) PushDownLimit will always yield Some(...).

Theoretical pain

Let's look at the issue formally... 😎

Consider small changes Δ_1, Δ_2, ... Δ_n - some signed integers.

The total change is:

Δ_all = Δ_1 + Δ_2 + ... + Δ_n

So Δ_all ≠ 0 implies Δ_i ≠ 0 for some i.

The premise of Option<Plan> is that the implication holds in the another direction - which is NOT the case!

Just give Δ_1 = 1, Δ_2 = -1, Δ_3 = 0... (Δ_2 cancels changes from Δ_1!)

Alternatives?

Let's look again at the implication - how can we use it?

Δ_all ≠ 0 implies Δ_i ≠ 0 for some i.

I would consider to use some stats about the tree - as a kind of heuristics. For example: number of nodes in the tree.

Each optimization could return (LogicalPlan, Δ of number of nodes). If sum of all Δ of number of nodes is nonzero - then we know the tree has changed.

I guess there could be other useful stats, like ∑ (node_type * node depth).

alamb commented 1 year ago

https://github.com/apache/arrow-datafusion/pull/5623 from @mslapek is a nice step towards this goal, I think

sadboy commented 7 months ago

To add a counter point to this change -- without sub-tree sharing, and in the presence of CTEs, LogicalPlan trees would be exponential in the size of the SQL query:

WITH
   A as (select 1 x),
   B as (select * from A, A)
   C as (select * from B, B)
   D as (select * from C, C)
select * from D;

tustvold commented 7 months ago

without sub-tree sharing

Is this something that is actually practicable? I would have thought the optimizer would simply ruin any effort to do this?

sadboy commented 7 months ago

I'm not familiar with the current optimizer implementation details, but this is a problem that manifests way before the optimizer comes into play -- if we take away sub-tree sharing in LogicalPlan, then the SQL compiler would be forced to generate exponential trees right from the start. Whereas in the current setup, (properly) generated LP trees would always be linear in the size of the input query, and if it blows up in some later optimizer stage, I assume it shouldn't be too hard to optimize the optimizer.

tustvold commented 7 months ago

Is this a problem? Is the memory usage of the plan representation a concern? This feels like a relatively niche optimisation for plans with repeated CTEs, that may perhaps be a touch premature? I would be very surprised if the optimizer won't blow this away when it rewrites the plans anyway.

sadboy commented 7 months ago

Yes this is a very real problem. We see this kind of pattern in production warehouse queries fairly often. They're usually the result of some automated query composition, and can get quite big by themselves. Tacking on an exponential factor on top means the system will be completely unusable (i.e. upwards of an hour just to compile one query, without even invoking the optimizer).

It's not just about the memory footprint -- if your datastructure itself is exponential then that's basically your lower bound for performance, as a simple operation like clone() would take exponential time. In general, exponential blow ups in production systems are deal breakers IMO, and removing them should not be considered premature optimization.

All that is to say, if you plan to remove Arc<LogicalPlan> (which I'm neutral), then you'll have to replace it with some other mechanism for common subtree sharing.

tustvold commented 7 months ago

:+1: I agree we should not regress any extant functionality in this space. That being said Arc is probably a poor way to go about sub-tree sharing, if it is used for this at all, as shared mutation is not possible. Some sort of mutable interner would likely be a better approach, and would facilitate optimising the given plan only once, as opposed to for every appearance

alamb commented 7 months ago

if we take away sub-tree sharing in LogicalPlan, then the SQL compiler would be forced to generate exponential trees right from the start. Whereas in the current setup, (properly) generated LP trees would always be linear in the size of the input query, and if it blows up in some later optimizer stage, I assume it shouldn't be too hard to optimize the optimizer.

For what it is worth, @mustafasrepo and I are working on something similar https://github.com/apache/arrow-datafusion/issues/8582 (in the physical plans now, any CTEs used more than once will be expanded out and the results not shared)

jayzhan211 commented 3 weeks ago

Note about clone to remove

ScalarSubqueryToJoin

https://github.com/apache/datafusion/blob/e65c3e919855c9977cf4d80c0630ee26b7fd03ee/datafusion/optimizer/src/scalar_subquery_to_join.rs#L153

apache / datafusion