Closed peter-toth closed 1 week ago
cc @alamb, @erratic-pattern and @MohamedAbdeen21 as this PR is related to your recent comments/PRs.
Possibly related https://github.com/apache/datafusion/pull/10333
This seems to be a major conflict with #10333
The source of the issue is that an idenfifier needs to represent an expression subtreee and the newly chosen "stringified expr" as identifier doesn't seem to fulfill that purpose. E.g. an identifier shouldn't belong to 2 different expressions
I agree that an identifier shouldn't belong to 2 different expressions, but why does it have to represent a subtree? The expr IS the subtree itself. If we use an identifier like #{expr}
that should be good enough.
And actually it turned out that the reasons are already explained in the comments
AFAIK, we only need the identifier to be unique (no collision) for correctness, I don't see why we require the other traits.
Edit:
Just took a look at the tests, we are basically trying to do the same thing, although your PR is probably more efficient. We only differ on the subtree part I mentioned above. I tried to do #{expr}
, you're trying {expr|subtree}
; which I think is unnecessary and unreadable. I would even argue that something like #1
is enough.
I will review this one carefully tomorrow morning.
cc @waynexia and @wiedld for your information
The source of the issue is that an idenfifier needs to represent an expression subtreee and the newly chosen "stringified expr" as identifier doesn't seem to fulfill that purpose. E.g. an identifier shouldn't belong to 2 different expressions
I agree that an identifier shouldn't belong to 2 different expressions, but why does it have to represent a subtree? The expr IS the subtree itself. If we use an identifier like
#{expr}
that should be good enough.
In the first traversal we need to count how many times we encountered with an Expr
(expression subtree). We either need to use Expr
as the key of the map to store the counts or an use an identifier that uniquely identifies an Expr
.
Edit:
Just took a look at the tests, we are basically trying to do the same thing, although your PR is probably more efficient. We only differ on the subtree part I mentioned above. I tried to do #{expr}, you're trying {expr|subtree}; which I think is unnecessary and unreadable. I would even argue that something like #1 is enough.
The issue is not about the aliases we assign to the extracted common expressions, it is about the key of the map where we store the counts. Your PR can be a good improvement to the aliases after this PR, but we need need to fix the key of map first.
Ok yeah, I see what you mean.
But the example you mention with a + b. Doesn't that go away if we fix the side note you mentioned?
col("a + b")
should be interpreted as table."a + b"
and not test.a + test.b
(and vice versa for the SQL example), meaning that expr
would never collide in the map, right or am I missing something?
But the example you mention with a + b. Doesn't that go away if you fix the side note you mentioned?
col("a + b")
should be interpreted astable."a + b"
and nottest.a + test.b
, meaning thatexpr
would never collide in the map, right or am I missing something?
There are multiple questions here and I don't have the answers for.
Expr
can be fixed, using a String
s as a key of the map is not a good choice as building a string represatation will very likely require traversing the whole expression. Think of that we need to get an identifier for an expression, and for all its subtrees, and for the subtrees' subtrees... So a good identifier would be something that we can build-up incrementally from the identifiers of the subtrees (in the first traversal's bottom-up phase). I think the best we can do now is to revert https://github.com/apache/datafusion/pull/9871 and return to the old chained string representation. (This PR improves the identifier readability a bit by adding {}
around it and |
as separator of elements.) And then in a follow-up PR replace the string representation to an alternative identifier like the one I mentioned in the PR description.
Although I'd like to find answers to these questions before giving more opinions, I don't mind merging this for now.
I also filed #10413 to track the bug you found (🦅 👁️ ). However, this PR doesn't seem to fix it yet 🤔 . I pushed a test to show this and also tried it manually:
No, this PR doesn't fix that issue at all. That issue is a resolution issue (https://github.com/apache/datafusion/issues/10413) and has nothing to do with CSE. The example I gave in the description doesn't contain any subexpressions to eliminate and CommonSubexprEliminate
has no effect on the query plan.
The reason I mentioned the resolution issue is because of that issue I couldn't add a test case to this PR which would illustrate the issue of colflicting identifiers in CommonSubexprEliminate
after https://github.com/apache/datafusion/pull/9871.
Once https://github.com/apache/datafusion/issues/10413 is solved I can add a test case here.
I believe @MohamedAbdeen21 used #{expr} in https://github.com/apache/datafusion/pull/10333 to follow what is done by DuckDB -- perhaps we could do so too in this PR (I also think #{} is slightly easier to notice visually than {})
I fully aggree that the current alias is very hard to read and this is because the identifiers are used for aliases as well. But there are 2 different things here:
ExprStats
map that stores the counts.Currently for both 1. and 2. we use the identifier and I'm sure that in 1. we have touse the identifier. In 2. I'm not sure and @MohamedAbdeen21's PR can be a good follow-up improvement.
No, this PR doesn't fix that issue at all. That issue is a resolution issue (https://github.com/apache/datafusion/issues/10413) and has nothing to do with CSE. The example I gave in the description doesn't contain any subexpressions to eliminate and CommonSubexprEliminate has no effect on the query plan.
Sorry -- I missed that -- updated https://github.com/apache/datafusion/issues/10413 to match
Thanks for the benchmarks @alamb! Maybe the longer identifiers can explain that gap.
@alamb, IMO if this PR can be merged then the next steps should be:
CommonSubexprEliminate
rule to rewrite
as that could improve a lot on https://github.com/apache/datafusion/issues/9873.ExprStats
was not the best choice. (Please note that this was not my choice but this is how CSE has been working since the feature was added initially.) See my comment on this in the PR description.I'm happy to take 4 as I already worked on it a bit, but unfortunately I have very little time to work on this project lately so I can't take 1. and 2.
I'll rebase my PR this weekend.
I do have other changes in mind regarding plan readability. If 1 is still available by the time I'm done, I'll be happy to take a look at it.
@alamb, IMO if this PR can be merged then the next steps should be:
- Fix Incorrect results with expression resolution #10413 as that is correctness bug.
Agree -- this is now tracked as its own issue and we can deal with it separately
- Continue Rewrite
CommonSubexprEliminate
to avoid copies using TreeNode #10067 efforts to refactor theCommonSubexprEliminate
rule torewrite
as that could improve a lot on Stop copyingExpr
s and LogicalPlans so much during Common Subexpression Elimination #9873.
I will do this
- Rebase @MohamedAbdeen21's make common expression alias human-readable #10333 on the top of this PR as probably we don't need to use the current string identifiers in aliases and we could improve readablity.
Sounds like @MohamedAbdeen21 is going to do this maybe this weekend
- Revisit the identifiers as using these string identifiers as the keys of
ExprStats
was not the best choice. (Please note that this was not my choice but this is how CSE has been working since the feature was added initially.) See my comment on this in the PR description.
👍
I'm happy to take 4 as I already worked on it a bit, but unfortunately I have very little time to work on this project lately so I can't take 1. and 2.
That would be amazing -- thank you @peter-toth -- I filed https://github.com/apache/datafusion/issues/10426 to track
All right, I think we have our next steps outlined and tracked with tickets. 🚀 !
Thanks again @peter-toth and @MohamedAbdeen21
Thanks for the review!
Which issue does this PR close?
Part of https://github.com/apache/datafusion/issues/9873.
Rationale for this change
This PR started as part of https://github.com/apache/datafusion/issues/9873 to reduce number of
Expr
clones but after some investigation it shifted to be a fix for the rule's correctness issues.The current
CommonSubexprEliminate
was refactored in https://github.com/apache/datafusion/pull/9871 to remove theIdArray
cache and simplify the identifier of expresions. Unfortunately that change doesn't seem to be correct. The source of the issue is that an idenfifier needs to represent an expression subtreee and the newly chosen "stringified expr" as identifier doesn't seem to fulfill that purpose. E.g. an identifier shouldn't belong to 2 different expressions:Sidenote: Actually I wanted to show that correctness issue of the current
CommonSubexprEliminate
in a test, but when I wrote a test with colliding column names I run into a different issue, that DataFusion resolves thecol("a") + col("b")
expression as if it wascol("a + b")
if ana + b
field exists in the schema. This is a different issue (not related toCommonSubexprEliminate
at all) and can be easily reproduced:So in this the first commit of PR I revert https://github.com/apache/datafusion/pull/9871.
Then I investigated what is the actual purpose of
Identifier
s, why don't we use a simpleHashMap<Expr, (usize, DataType, Identifier)>
asExprSet
? It is clear that we need to generate a unique alias for the extracted common expressions, but why is the key of the map is anIdentifier
and not&Expr
orExpr
itself. And actually it turned out that the reasons are already explained in the comments.If we used
Expr
as the key of the map computing thehash()
of the keys would require traversing on the wholeExpr
, which can be very costly asExpr
s contain indirections to subexpressions asBox<Expr>
orVec<Expr>
.Using special identifiers to represent
Expr
trees and caching those identifiers by the preorder visit indexes inIdArray
should significantly speed up the second top-down traversal that does the actual expression rewrite.Sidenote: the current long
String
identifiers are also not a good choice. We need to revisit this in a follow-up PR and choose something like(usize, &Expr)
tuple as identifiers. The first element of a tuple is a pre-calculatedhash()
of an expression tree, that is built-up during the first bottom-up traversal. And the referece to expression is there to implement theeq()
.The second commit is a refactor and fix of the algorithm as reverting https://github.com/apache/datafusion/issues/9873 caused the https://github.com/apache/datafusion/issues/9870 issue to resurface. This is a major refactor but I think the code of
ExprIdentifierVisitor
andCommonSubexprRewriter
became much cleaner.The 3rd commit eliminates some
Expr
clones inExprSet
s.The 4th and 5th commit contain only renames and docs fixes. I think
ExprStats
is a better name forExprSet
as the purpose of that data structure is store the counts. Also, IMOCommonExpr
/common_exprs
is a better name foraffected_id
to store the common expressions that got extracted.What changes are included in this PR?
Please see above.
Are these changes tested?
Yes, with existing UTs.
Are there any user-facing changes?
No.