Fix and improve `CommonSubexprEliminate` rule

peter-toth commented 1 week ago

Which issue does this PR close?

Part of https://github.com/apache/datafusion/issues/9873.

Rationale for this change

This PR started as part of https://github.com/apache/datafusion/issues/9873 to reduce number of Expr clones but after some investigation it shifted to be a fix for the rule's correctness issues.

The current CommonSubexprEliminate was refactored in https://github.com/apache/datafusion/pull/9871 to remove the IdArray cache and simplify the identifier of expresions. Unfortunately that change doesn't seem to be correct. The source of the issue is that an idenfifier needs to represent an expression subtreee and the newly chosen "stringified expr" as identifier doesn't seem to fulfill that purpose. E.g. an identifier shouldn't belong to 2 different expressions:
```
println!("{}", ExprSet::expr_identifier(&(col("a") + col("b"))));    // a + b
println!("{}", ExprSet::expr_identifier(&(col("a + b"))));           // a + b
```
Sidenote: Actually I wanted to show that correctness issue of the current CommonSubexprEliminate in a test, but when I wrote a test with colliding column names I run into a different issue, that DataFusion resolves the col("a") + col("b") expression as if it was col("a + b") if an a + b field exists in the schema. This is a different issue (not related to CommonSubexprEliminate at all) and can be easily reproduced:
```
DataFusion CLI v37.1.0
> select a + b from (select 1 as a, 2 as b, 1 as "a + b");
+-------+
| a + b |
+-------+
| 1     | <- Should be 3
+-------+
1 row(s) fetched.
Elapsed 0.009 seconds.
```
So in this the first commit of PR I revert https://github.com/apache/datafusion/pull/9871.
Then I investigated what is the actual purpose of Identifiers, why don't we use a simple HashMap<Expr, (usize, DataType, Identifier)> as ExprSet? It is clear that we need to generate a unique alias for the extracted common expressions, but why is the key of the map is an Identifier and not &Expr or Expr itself. And actually it turned out that the reasons are already explained in the comments.
```
/// An identifier should (ideally) be able to "hash", "accumulate", "equal" and "have no
/// collision (as low as possible)"
```
If we used Expr as the key of the map computing the hash() of the keys would require traversing on the whole Expr, which can be very costly as Exprs contain indirections to subexpressions as Box<Expr> or Vec<Expr>.

Using special identifiers to represent Expr trees and caching those identifiers by the preorder visit indexes in IdArray should significantly speed up the second top-down traversal that does the actual expression rewrite.

Sidenote: the current long String identifiers are also not a good choice. We need to revisit this in a follow-up PR and choose something like (usize, &Expr) tuple as identifiers. The first element of a tuple is a pre-calculated hash() of an expression tree, that is built-up during the first bottom-up traversal. And the referece to expression is there to implement the eq().
The second commit is a refactor and fix of the algorithm as reverting https://github.com/apache/datafusion/issues/9873 caused the https://github.com/apache/datafusion/issues/9870 issue to resurface. This is a major refactor but I think the code of ExprIdentifierVisitor and CommonSubexprRewriter became much cleaner.
The 3rd commit eliminates some Expr clones in ExprSets.
The 4th and 5th commit contain only renames and docs fixes. I think ExprStats is a better name for ExprSet as the purpose of that data structure is store the counts. Also, IMO CommonExpr / common_exprs is a better name for affected_id to store the common expressions that got extracted.

What changes are included in this PR?

Please see above.

Are these changes tested?

Yes, with existing UTs.

Are there any user-facing changes?

No.

peter-toth commented 1 week ago

cc @alamb, @erratic-pattern and @MohamedAbdeen21 as this PR is related to your recent comments/PRs.

alamb commented 1 week ago

MohamedAbdeen21 commented 1 week ago

This seems to be a major conflict with #10333

The source of the issue is that an idenfifier needs to represent an expression subtreee and the newly chosen "stringified expr" as identifier doesn't seem to fulfill that purpose. E.g. an identifier shouldn't belong to 2 different expressions

I agree that an identifier shouldn't belong to 2 different expressions, but why does it have to represent a subtree? The expr IS the subtree itself. If we use an identifier like #{expr} that should be good enough.

And actually it turned out that the reasons are already explained in the comments

AFAIK, we only need the identifier to be unique (no collision) for correctness, I don't see why we require the other traits.

Edit:

Just took a look at the tests, we are basically trying to do the same thing, although your PR is probably more efficient. We only differ on the subtree part I mentioned above. I tried to do #{expr}, you're trying {expr|subtree}; which I think is unnecessary and unreadable. I would even argue that something like #1 is enough.

alamb commented 1 week ago

I will review this one carefully tomorrow morning.

alamb commented 1 week ago

cc @waynexia and @wiedld for your information

peter-toth commented 1 week ago

The source of the issue is that an idenfifier needs to represent an expression subtreee and the newly chosen "stringified expr" as identifier doesn't seem to fulfill that purpose. E.g. an identifier shouldn't belong to 2 different expressions

I agree that an identifier shouldn't belong to 2 different expressions, but why does it have to represent a subtree? The expr IS the subtree itself. If we use an identifier like #{expr} that should be good enough.

In the first traversal we need to count how many times we encountered with an Expr (expression subtree). We either need to use Expr as the key of the map to store the counts or an use an identifier that uniquely identifies an Expr.

Edit:

Just took a look at the tests, we are basically trying to do the same thing, although your PR is probably more efficient. We only differ on the subtree part I mentioned above. I tried to do #{expr}, you're trying {expr|subtree}; which I think is unnecessary and unreadable. I would even argue that something like #1 is enough.

The issue is not about the aliases we assign to the extracted common expressions, it is about the key of the map where we store the counts. Your PR can be a good improvement to the aliases after this PR, but we need need to fix the key of map first.

MohamedAbdeen21 commented 1 week ago

Ok yeah, I see what you mean.

But the example you mention with a + b. Doesn't that go away if we fix the side note you mentioned?

col("a + b") should be interpreted as table."a + b" and not test.a + test.b (and vice versa for the SQL example), meaning that expr would never collide in the map, right or am I missing something?

peter-toth commented 1 week ago

But the example you mention with a + b. Doesn't that go away if you fix the side note you mentioned?

col("a + b") should be interpreted as table."a + b" and not test.a + test.b, meaning that expr would never collide in the map, right or am I missing something?

There are multiple questions here and I don't have the answers for.

Can we fix the string representation? Do we always have a table reference to prefix the columns with? And actually, is the resolution bug related to the string representation of expressions at all?
And even if the string representation of Expr can be fixed, using a Strings as a key of the map is not a good choice as building a string represatation will very likely require traversing the whole expression. Think of that we need to get an identifier for an expression, and for all its subtrees, and for the subtrees' subtrees... So a good identifier would be something that we can build-up incrementally from the identifiers of the subtrees (in the first traversal's bottom-up phase).

I think the best we can do now is to revert https://github.com/apache/datafusion/pull/9871 and return to the old chained string representation. (This PR improves the identifier readability a bit by adding {} around it and | as separator of elements.) And then in a follow-up PR replace the string representation to an alternative identifier like the one I mentioned in the PR description.

MohamedAbdeen21 commented 1 week ago

Although I'd like to find answers to these questions before giving more opinions, I don't mind merging this for now.

peter-toth commented 1 week ago

I also filed #10413 to track the bug you found (🦅 👁️ ). However, this PR doesn't seem to fix it yet 🤔 . I pushed a test to show this and also tried it manually:

No, this PR doesn't fix that issue at all. That issue is a resolution issue (https://github.com/apache/datafusion/issues/10413) and has nothing to do with CSE. The example I gave in the description doesn't contain any subexpressions to eliminate and CommonSubexprEliminate has no effect on the query plan.

The reason I mentioned the resolution issue is because of that issue I couldn't add a test case to this PR which would illustrate the issue of colflicting identifiers in CommonSubexprEliminate after https://github.com/apache/datafusion/pull/9871.

Once https://github.com/apache/datafusion/issues/10413 is solved I can add a test case here.

I believe @MohamedAbdeen21 used #{expr} in https://github.com/apache/datafusion/pull/10333 to follow what is done by DuckDB -- perhaps we could do so too in this PR (I also think #{} is slightly easier to notice visually than {})

I fully aggree that the current alias is very hard to read and this is because the identifiers are used for aliases as well. But there are 2 different things here:

The identifier must be unique for each unique expression when it is used as the key of the ExprStats map that stores the counts.
The aliases we gave to extracted common expressions shouldn't collide.

Currently for both 1. and 2. we use the identifier and I'm sure that in 1. we have touse the identifier. In 2. I'm not sure and @MohamedAbdeen21's PR can be a good follow-up improvement.

alamb commented 1 week ago

No, this PR doesn't fix that issue at all. That issue is a resolution issue (https://github.com/apache/datafusion/issues/10413) and has nothing to do with CSE. The example I gave in the description doesn't contain any subexpressions to eliminate and CommonSubexprEliminate has no effect on the query plan.

Sorry -- I missed that -- updated https://github.com/apache/datafusion/issues/10413 to match

peter-toth commented 1 week ago

Thanks for the benchmarks @alamb! Maybe the longer identifiers can explain that gap.

peter-toth commented 1 week ago

@alamb, IMO if this PR can be merged then the next steps should be:

Fix https://github.com/apache/datafusion/issues/10413 as that is correctness bug.
Continue https://github.com/apache/datafusion/pull/10067 efforts to refactor the CommonSubexprEliminate rule to rewrite as that could improve a lot on https://github.com/apache/datafusion/issues/9873.
Rebase @MohamedAbdeen21's https://github.com/apache/datafusion/pull/10333 on the top of this PR as probably we don't need to use the current string identifiers in aliases and we could improve readablity.
Revisit the identifiers as using these string identifiers as the keys of ExprStats was not the best choice. (Please note that this was not my choice but this is how CSE has been working since the feature was added initially.) See my comment on this in the PR description.

I'm happy to take 4 as I already worked on it a bit, but unfortunately I have very little time to work on this project lately so I can't take 1. and 2.

MohamedAbdeen21 commented 1 week ago

I'll rebase my PR this weekend.

I do have other changes in mind regarding plan readability. If 1 is still available by the time I'm done, I'll be happy to take a look at it.

alamb commented 1 week ago

@alamb, IMO if this PR can be merged then the next steps should be:

Fix Incorrect results with expression resolution #10413 as that is correctness bug.

Agree -- this is now tracked as its own issue and we can deal with it separately

Continue Rewrite CommonSubexprEliminate to avoid copies using TreeNode #10067 efforts to refactor the CommonSubexprEliminate rule to rewrite as that could improve a lot on Stop copying Exprs and LogicalPlans so much during Common Subexpression Elimination #9873.

I will do this

Rebase @MohamedAbdeen21's make common expression alias human-readable #10333 on the top of this PR as probably we don't need to use the current string identifiers in aliases and we could improve readablity.

Sounds like @MohamedAbdeen21 is going to do this maybe this weekend

Revisit the identifiers as using these string identifiers as the keys of ExprStats was not the best choice. (Please note that this was not my choice but this is how CSE has been working since the feature was added initially.) See my comment on this in the PR description.

👍

I'm happy to take 4 as I already worked on it a bit, but unfortunately I have very little time to work on this project lately so I can't take 1. and 2.

That would be amazing -- thank you @peter-toth -- I filed https://github.com/apache/datafusion/issues/10426 to track

alamb commented 1 week ago

All right, I think we have our next steps outlined and tracked with tickets. 🚀 !

alamb commented 1 week ago

Thanks again @peter-toth and @MohamedAbdeen21

peter-toth commented 1 week ago

Thanks for the review!

apache / datafusion