apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.33k stars 1.2k forks source link

feat: Wrap `TableScan` with `Filter` in Join Unparsing #13496

Open jonathanc-n opened 2 days ago

jonathanc-n commented 2 days ago

Which issue does this PR close?

Closes #13156 .

Rationale for this change

What changes are included in this PR?

Pushes down filter to tablescan instead of having it apply on the join in unparsing.

Are these changes tested?

Changed the previous test

Are there any user-facing changes?

sgrebnov commented 4 hours ago

@jonathanc-n, @goldmedal - thank you, I've reviewed this change and it seems it brings back the following issue (there is additional context of why filtering added this way produces incorrect result) https://github.com/apache/datafusion/pull/13132 . I really like the change but can we improve this to see if we can wrap TableScan with Filter as a subquery when it is required

Example query.

Original query / LogicalPlan / Result

select
    c_custkey,
    count(o_orderkey)
from
    customer left join orders on c_custkey = o_custkey and o_comment not like '%special%requests%'
group by
    c_custkey
|               |  Projection: customer.c_custkey, count(orders.o_orderkey)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
|               |   Aggregate: groupBy=[[customer.c_custkey]], aggr=[[count(orders.o_orderkey)]]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                |
|               |     Left Join:  Filter: customer.c_custkey = orders.o_custkey                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
|               |       TableScan: customer projection=[c_custkey]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
|               |       TableScan: orders projection=[o_orderkey, o_custkey], full_filters=[orders.o_comment NOT LIKE Utf8("%special%requests%")]  

Result:

1489    29
1269    0
652 24
273 0
51  0
...

Existing unparser (main)

select
    "customer"."c_custkey",
    count("orders"."o_orderkey")
from
    "customer"
left join "orders" on
    (("customer"."c_custkey" = "orders"."o_custkey")
        and "orders"."o_comment" not like '%special%requests%')
group by
    "customer"."c_custkey"

Result:

1489    29
1269    0
652 24
273 0
51  0
...

Proposed change

select
    "customer"."c_custkey",
    count("orders"."o_orderkey")
from
    "customer"
left join "orders" on
    ("customer"."c_custkey" = "orders"."o_custkey")
where
    "orders"."o_comment" not like '%special%requests%'
group by
    "customer"."c_custkey"

Result

1489    29
652 24
1091    1
70  15
839 14

If intent for filter to be moved it must be wrapped as subquery in this case:

select
    c_custkey,
    count(o_orderkey)
from
    customer left join (select * from orders where o_comment not like '%special%requests%') on c_custkey = o_custkey
group by
    c_custkey

Result

1489    29
1269    0
652 24
273 0
51  0