apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.87k stars 1.11k forks source link

Allow custom planning behavior for selecting wildcard expression #11639

Closed goldmedal closed 3 weeks ago

goldmedal commented 1 month ago

Is your feature request related to a problem or challenge?

Currently, DataFusion always expands the wildcard expression to the list of columns when planning a SELECT statement. This causes problems for those who want to check the wildcard expression in the logical plan layer.

For example, given a table "A" with three columns: c1, c2, and c3, it is difficult to distinguish between SELECT * FROM A and SELECT c1, c2, c3 FROM A in the logical plan layer.

If we provide a way for users to customize the behavior of wildcard expression planning, it will be easier to distinguish between these two cases.

Describe the solution you'd like

I plan to implement this feature based on ExprPlanner. I guess we can add a method called plan_select_wildcard and putting the default expression expansion there. Then, invoke this method in SqlToRel::sql_select_to_rex. https://github.com/apache/datafusion/blob/main/datafusion/sql/src/select.rs#L594

Describe alternatives you've considered

No response

Additional context

My current project, Wren engine, is a semantic engine based on the DataFusion Logical Plan. We have a type of column called a "calculated field" that can be selected specifically but won't be included when querying using the wildcard expression. Since it could trigger some joins for its parent table, we want to ensure that it is only used when the user specifically selects it. This feature helps us align with the native behavior of DataFusion.

goldmedal commented 1 month ago

take

goldmedal commented 1 month ago

Hi @alamb, do you know the purpose of this comment? https://github.com/apache/datafusion/blob/7db4213b71ed9e914c5a4f16954abfa20b091ae3/datafusion/expr/src/logical_plan/builder.rs#L1443-L1458

I believe this expansion for wildcard duplicates what the planner does. https://github.com/apache/datafusion/blob/cf9da768306a1e103bfeae68f4f2ed3dfe87df7b/datafusion/sql/src/select.rs#L594-L595

I found the wildcard expression is still expanded even when I disable this behavior in the planner. I think the reason is that the planner invokes the LogicalPlanner to build the projection, which will expand the wildcard expression. https://github.com/apache/datafusion/blob/7db4213b71ed9e914c5a4f16954abfa20b091ae3/datafusion/sql/src/select.rs#L693

Could we remove the expansion behavior in the builder? (I guess that's what the comment purposes.)

alamb commented 1 month ago

Could we remove the expansion behavior in the builder? (I guess that's what the comment purposes.)

I think that is my understanding as well

I believe @jayzhan211 and @tshauck have had some recent discussion about this as well, so maybe they remember more than do I

tshauck commented 1 month ago

We were working on the related COUNT(*). At least in that case the * is just intercepted as the argument and replaced with 1, so I don't think there'd be a conflict between that work and changing the expansion behavior in the projection itself.

goldmedal commented 1 month ago

We were working on the related COUNT(*). At least in that case the * is just intercepted as the argument and replaced with 1, so I don't think there'd be a conflict between that work and changing the expansion behavior in the projection itself.

Thanks, @tshauck. I agree there is no conflict between them.

I'm curious, which PR did you do for this? There are duplicate behaviors for the expansion (in the select all case, we will expand the expression twice). I want to check if this behavior can be removed. If the expansion in the builder is required for another case, I will also need to modify it. If not, I think we can just remove it and address the TODO comment.

tshauck commented 1 month ago

It was part of the work to support COUNT() (note no *): https://github.com/apache/datafusion/pull/11229

jayzhan211 commented 1 month ago

@goldmedal If we move the expression expansion to analyzer as the comment said, I guess you can still tell the difference from SELECT * FROM A and SELECT c1, c2, c3 FROM A before optimizer. I think expand expression in builder is quite early, if there are 1000 columns, we need to carry those columns around in many places. It is better to delay expansion lazily as long as possible.

goldmedal commented 1 month ago

@goldmedal If we move the expression expansion to analyzer as the comment said, I guess you can still tell the difference from SELECT * FROM A and SELECT c1, c2, c3 FROM A before optimizer. I think expand expression in builder is quite early, if there are 1000 columns, we need to carry those columns around in many places. It is better to delay expansion lazily as long as possible.

I think I got it. So, ideally, we don't do any expansions in the planner or builder. We need to have an analyze rule or optimize rule to expand the expression. If we move this to analyze or optimizer, I can get the wildcard expression before optimizing, even if we don't add a customized behavior in ExprPlanner, right?

jayzhan211 commented 1 month ago

Correct

On Sat, Jul 27, 2024, 2:16 PM Jax Liu @.***> wrote:

@goldmedal https://github.com/goldmedal If we move the expression expansion to analyzer as the comment said, I guess you can still tell the difference from SELECT FROM A and SELECT c1, c2, c3 FROM A before optimizer. I think expand expression in builder is quite early, if there are 1000 columns, we need to carry those columns around in many places. It is better to delay expansion lazily* as long as possible.

I think I got it. So, ideally, we don't do any expansions in the planner or builder. We need to have an analyze rule or optimize rule to expand the expression. If we move this to analyze or optimizer, I can get the wildcard expression before optimizing, even if we don't add a customized behavior in ExprPlanner, right?

— Reply to this email directly, view it on GitHub https://github.com/apache/datafusion/issues/11639#issuecomment-2253813569, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADZCLR3QAJ4XRXA7VKFYJITZOM3KBAVCNFSM6AAAAABLM6DYKKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJTHAYTGNJWHE . You are receiving this because you were mentioned.Message ID: @.***>

goldmedal commented 1 month ago

I have drafted a version https://github.com/apache/datafusion/pull/11681 for moving wildcard expansions to the analyzer. I think it's better than #11673. I might change the purpose of this PR. WDYT @alamb, @jayzhan211 ?

By the way, I'm thinking about handling replace syntax. We can discuss this issue in the PR if you have some ideas.

jayzhan211 commented 1 month ago

I think handling replace item in options is a good idea. We store the replace item in Expr::Wildcard in planning stage and convert it to expanded columns with expected replace items in analyzer.