This PR serves as an minor performance improvement for changes in https://github.com/apache/datafusion/pull/13132
when rewriting plans that has aggregates with lhs / rhs with filter and scan containing same filter.
For query
select
c_custkey,
count(o_orderkey)
from
customer left outer join orders on
c_custkey = o_custkey
and o_comment not like '%special%requests%'
group by
c_custkey
The logical plan is
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | BytesProcessedNode |
| | Federated |
| | Projection: customer.c_custkey, count(orders.o_orderkey) |
| | Aggregate: groupBy=[[customer.c_custkey]], aggr=[[count(orders.o_orderkey)]] |
| | Left Join: Filter: customer.c_custkey = orders.o_custkey |
| | TableScan: customer |
| | Filter: orders.o_comment NOT LIKE Utf8("%special%requests%") |
| | TableScan: orders, partial_filters=[orders.o_comment NOT LIKE Utf8("%special%requests%")]
The rewritten query will be:
SELECT customer.c_custkey, count(orders.o_orderkey) FROM customer LEFT JOIN orders ON ((customer.c_custkey = orders.o_custkey) AND (orders.o_comment NOT LIKE '%special%requests%' AND orders.o_comment NOT LIKE '%special%requests%')) GROUP BY customer.c_custkey
Under the current approach, the filter orders.o_comment NOT LIKE Utf8("%special%requests%") will occur twice in final query, although this has no effect on query result correctness, it brings performance overhead by including duplicated conditions.
Which issue does this PR close?
N/A
Rationale for this change
For query
The logical plan is
The rewritten query will be:
SELECT customer.c_custkey, count(orders.o_orderkey) FROM customer LEFT JOIN orders ON ((customer.c_custkey = orders.o_custkey) AND (orders.o_comment NOT LIKE '%special%requests%' AND orders.o_comment NOT LIKE '%special%requests%')) GROUP BY customer.c_custkey
Under the current approach, the filter
orders.o_comment NOT LIKE Utf8("%special%requests%")
will occur twice in final query, although this has no effect on query result correctness, it brings performance overhead by including duplicated conditions.What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
No