Keep track of common sub-expression across logical plan nodes

mustafasrepo commented 8 months ago

Is your feature request related to a problem or challenge?

No response

Describe the solution you'd like

Currently, common CommonSubexprEliminate LogicalPlan optimizer rule analyzes common sub-expressions in a query. Then caches, common sub-expression by adding a LogicalPlan::Projection if it thinks this is beneficial. As an example, following query

SELECT c3+c4, SUM(c3+c4) OVER(order by c3+c4)
FROM t

generates following LogicalPlan:

Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ORDER BY [t.c3 + t.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4t.c4t.c3 AS t.c3 + t.c4 AS Int64)) ORDER BY [t.c3 + t.c4t.c4t.c3 AS t.c3 + t.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW AS SUM(t.c3 + t.c4) ORDER BY [t.c3 + t.c4 ASC NULLS LAST] RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW]]
----Projection: t.c3 + t.c4 AS t.c3 + t.c4t.c4t.c3, t.c3, t.c4
------TableScan: t projection=[c3, c4]

where t.c3+t.c4 is calculated once in the Projection then referred by subsequent WindowAggr as a column.

However, following query:

SELECT c3+c4, SUM(c3+c4) OVER()
FROM t

generates following LogicalPlan:

Projection: t.c3 + t.c4, SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
--WindowAggr: windowExpr=[[SUM(CAST(t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
----TableScan: t projection=[c3, c4]

instead we could generate following plan:

Projection: col(t.c3 + t.c4), SUM(t.c3 + t.c4) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
--WindowAggr: windowExpr=[[SUM(CAST(col(t.c3 + t.c4) AS t.c3 + t.c4 AS Int64)) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]]
----Projection: t.c3 + t.c4 AS col(t.c3 + t.c4)
------TableScan: t projection=[c3, c4]

If were to keep track of common sub expression counts globally across different nodes in the LogicalPlan. This will enable us to generate better LogicalPlans.

Describe alternatives you've considered

No response

Additional context

No response

Lordworms commented 8 months ago

I can do this one

mustafasrepo commented 8 months ago

One can use following query to generate table t

CREATE EXTERNAL TABLE t (
  c1  VARCHAR NOT NULL,
  c2  TINYINT NOT NULL,
  c3  SMALLINT NOT NULL,
  c4  SMALLINT,
  c5  INT,
  c6  BIGINT NOT NULL,
  c7  SMALLINT NOT NULL,
  c8  INT NOT NULL,
  c9  BIGINT UNSIGNED NOT NULL,
  c10 VARCHAR NOT NULL,
  c11 FLOAT NOT NULL,
  c12 DOUBLE NOT NULL,
  c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH HEADER ROW
LOCATION '../../testing/data/csv/aggregate_test_100.csv'

Lordworms commented 8 months ago

Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature

I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)
The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion. Which one do you think is better?

mustafasrepo commented 8 months ago

Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature

I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)

The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion. Which one do you think is better?

The current approach is we use projection to calculate a complex expression if it is used at least twice (Otherwise projection deemed unnecessary). Hence, first option wouldn't work in this case. The other approach may work, however, it may place the projection in a sub-optimal spot. As an example, consider following plan,

Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)

with second approach you might produce plan below (still better than current behaviour. However, sub-optimal)

Projection(`a+b`)
--Filter (`a+b`=0),
----Projection (a+b as `a+b`)
------Sort(a+b ASC),
--------TableScan(a,b)

where I used `a+b` to distinguish it from binary expression a+b. However, instead we could have generated following plan

Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)

Hence, I think best approach is to traverse plan from top to bottom and keeping the cumulative complex expression counts in the plan. For plan below

Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)

This would produce

Projection(a+b), ("a+b", count: 1)
--Filter (a+b=0), ("a+b", count: 2)
----Sort(a+b ASC),  ("a+b", count: 2)
------TableScan(a,b)

Then after constructing, above tree. With a bottom-up traversal we can generate following plan

Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)

by implacing projections to calculate common expression that are used more than once by subsequent stages. However, I presume this would involve a lot of work.

Lordworms commented 8 months ago

Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature

I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)

The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion. Which one do you think is better?

The current approach is we use projection to calculate a complex expression if it is used at least twice (Otherwise projection deemed unnecessary). Hence, first option wouldn't work in this case. The other approach may work, however, it may place the projection in a sub-optimal spot. As an example, consider following plan,
Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)
with second approach you might produce plan below (still better than current behaviour. However, sub-optimal)
Projection(`a+b`)
--Filter (`a+b`=0),
----Projection (a+b as `a+b`)
------Sort(a+b ASC),
--------TableScan(a,b)
where I used a+b to distinguish it from binary expression a+b. However, instead we could have generated following plan
Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)
Hence, I think best approach is to traverse plan from top to bottom and keeping the cumulative complex expression counts in the plan. For plan below
Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)
This would produce
Projection(a+b), ("a+b", count: 1)
--Filter (a+b=0), ("a+b", count: 2)
----Sort(a+b ASC),  ("a+b", count: 2)
------TableScan(a,b)
Then after constructing, above tree. With a bottom-up traversal we can generate following plan
Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)
by implacing projections to calculate common expression that are used more than once by subsequent stages. However, I presume this would involve a lot of work.

I got it, Thanks for your solutions. I plan to implement this today.

apache / datafusion