apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.34k stars 1.2k forks source link

`PushDownProjection` doesn't push down the empty projection to the child plans of `Union` #4772

Open jackwener opened 1 year ago

jackwener commented 1 year ago

_Originally posted by @HaoYang670 in https://github.com/apache/arrow-datafusion/pull/4753#discussion_r1058696036_

Jefffrey commented 1 year ago

Could you elaborate on the expected behaviour here? From the original discussion, regarding this test:

https://github.com/apache/arrow-datafusion/blob/5d4038a8463a575328bedbc22b32456f5dcd562c/datafusion/core/tests/sql/union.rs#L84-L98

I did an explain plan:

+------------------------------------------------------------+-----------------------------------------------------------------+
| plan_type                                                  | plan                                                            |
+------------------------------------------------------------+-----------------------------------------------------------------+
| initial_logical_plan                                       | Projection: COUNT(UInt8(1))                                     |
|                                                            |   Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]             |
|                                                            |     Union                                                       |
|                                                            |       Projection: t.a                                           |
|                                                            |         TableScan: t                                            |
|                                                            |       Projection: t.a                                           |
|                                                            |         TableScan: t                                            |
| logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                                              |
...
| logical_plan after common_sub_expression_eliminate         | SAME TEXT AS ABOVE                                              |
| logical_plan after push_down_projection                    | Projection: COUNT(UInt8(1))                                     |
|                                                            |   Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]             |
|                                                            |     Union                                                       |
|                                                            |       TableScan: t projection=[a]                               |
|                                                            |       TableScan: t projection=[a]                               |
| logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                                              |
...
| logical_plan after push_down_projection                    | SAME TEXT AS ABOVE                                              |
| logical_plan                                               | Projection: COUNT(UInt8(1))                                     |
|                                                            |   Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]             |
|                                                            |     Union                                                       |
|                                                            |       TableScan: t projection=[a]                               |
|                                                            |       TableScan: t projection=[a]                               |

What would be the expected logical plan, if it's not this?

HaoYang670 commented 1 year ago

Hi @Jefffrey, actually I am not sure whether we should do this. It is just something appearing in my mind one day.

(sorry it should be the aggregation functions but not empty projection in the title)

For union all, we can push down the aggregation functions to the child plans. For example, for a query select count(*) from (select * from t1 union select * from t2), we can optimize the logical plan to count each child plan and then sum them together.

Projection: Sum(col1) as COUNT(UInt8(1))                                 
  Aggregate: groupBy=[[]], aggr=[[Sum(col1)]]           
    Union                   
      Projection: COUNT(UInt8(1)) as  col1                                
        Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]                                     
          TableScan: t projection=[a]        
      Projection: COUNT(UInt8(1)) as col1                                
        Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]]                        
          TableScan: t projection=[a]

And it is similar for min, max, avg ...

Jefffrey commented 1 year ago

Ok I see what you mean now @HaoYang670

I suppose this could be a new logical optimizer rule push_down_aggregate, could be worth experimenting with I suppose.