Open mustafasrepo opened 8 months ago
I can do this one
One can use following query to generate table t
CREATE EXTERNAL TABLE t (
c1 VARCHAR NOT NULL,
c2 TINYINT NOT NULL,
c3 SMALLINT NOT NULL,
c4 SMALLINT,
c5 INT,
c6 BIGINT NOT NULL,
c7 SMALLINT NOT NULL,
c8 INT NOT NULL,
c9 BIGINT UNSIGNED NOT NULL,
c10 VARCHAR NOT NULL,
c11 FLOAT NOT NULL,
c12 DOUBLE NOT NULL,
c13 VARCHAR NOT NULL
)
STORED AS CSV
WITH HEADER ROW
LOCATION '../../testing/data/csv/aggregate_test_100.csv'
Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature
Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature
- I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)
- The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion. Which one do you think is better?
The current approach is we use projection to calculate a complex expression if it is used at least twice (Otherwise projection deemed unnecessary). Hence, first option wouldn't work in this case. The other approach may work, however, it may place the projection in a sub-optimal spot. As an example, consider following plan,
Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)
with second approach you might produce plan below (still better than current behaviour. However, sub-optimal)
Projection(`a+b`)
--Filter (`a+b`=0),
----Projection (a+b as `a+b`)
------Sort(a+b ASC),
--------TableScan(a,b)
where I used `a+b` to distinguish it from binary expression a+b
.
However, instead we could have generated following plan
Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)
Hence, I think best approach is to traverse plan from top to bottom and keeping the cumulative complex expression counts in the plan. For plan below
Projection(a+b)
--Filter (a+b=0),
----Sort(a+b ASC),
------TableScan(a,b)
This would produce
Projection(a+b), ("a+b", count: 1)
--Filter (a+b=0), ("a+b", count: 2)
----Sort(a+b ASC), ("a+b", count: 2)
------TableScan(a,b)
Then after constructing, above tree. With a bottom-up traversal we can generate following plan
Projection(`a+b`)
--Filter (`a+b`=0),
----Sort(`a+b` ASC),
------Projection (a+b as `a+b`)
--------TableScan(a,b)
by implacing projections to calculate common expression that are used more than once by subsequent stages. However, I presume this would involve a lot of work.
Hey @mustafasrepo, the reason that the first plan added a new Projection is that in the Rewriter it would mark the c3+c4 twice so that it judges the expressions needed to add an extra Projection layer. However, here I got two Problem and wish you could give me an answer. Currently, it seems like I have two ways to implement this feature
- I can directly go through all the expressions of a single plan and if I find a BinaryOp then I just add another Projection upon the current plan with sub query (easy solution)
- The second one is that we should track a DAG over the expression and if it is referenced in another plan, we add an extra projection. (which I have no idea how to properly trace them in treenode recursion), I don't want to do another recursion. Which one do you think is better?
The current approach is we use projection to calculate a complex expression if it is used at least twice (Otherwise projection deemed unnecessary). Hence, first option wouldn't work in this case. The other approach may work, however, it may place the projection in a sub-optimal spot. As an example, consider following plan,
Projection(a+b) --Filter (a+b=0), ----Sort(a+b ASC), ------TableScan(a,b)
with second approach you might produce plan below (still better than current behaviour. However, sub-optimal)
Projection(`a+b`) --Filter (`a+b`=0), ----Projection (a+b as `a+b`) ------Sort(a+b ASC), --------TableScan(a,b)
where I used
a+b
to distinguish it from binary expressiona+b
. However, instead we could have generated following planProjection(`a+b`) --Filter (`a+b`=0), ----Sort(`a+b` ASC), ------Projection (a+b as `a+b`) --------TableScan(a,b)
Hence, I think best approach is to traverse plan from top to bottom and keeping the cumulative complex expression counts in the plan. For plan below
Projection(a+b) --Filter (a+b=0), ----Sort(a+b ASC), ------TableScan(a,b)
This would produce
Projection(a+b), ("a+b", count: 1) --Filter (a+b=0), ("a+b", count: 2) ----Sort(a+b ASC), ("a+b", count: 2) ------TableScan(a,b)
Then after constructing, above tree. With a bottom-up traversal we can generate following plan
Projection(`a+b`) --Filter (`a+b`=0), ----Sort(`a+b` ASC), ------Projection (a+b as `a+b`) --------TableScan(a,b)
by implacing projections to calculate common expression that are used more than once by subsequent stages. However, I presume this would involve a lot of work.
I got it, Thanks for your solutions. I plan to implement this today.
Is your feature request related to a problem or challenge?
No response
Describe the solution you'd like
Currently, common
CommonSubexprEliminate
LogicalPlan
optimizer rule analyzes common sub-expressions in a query. Then caches, common sub-expression by adding aLogicalPlan::Projection
if it thinks this is beneficial. As an example, following querygenerates following
LogicalPlan
:where
t.c3+t.c4
is calculated once in theProjection
then referred by subsequentWindowAggr
as a column.However, following query:
generates following
LogicalPlan
:instead we could generate following plan:
If were to keep track of common sub expression counts globally across different nodes in the
LogicalPlan
. This will enable us to generate betterLogicalPlan
s.Describe alternatives you've considered
No response
Additional context
No response