Open avamingli opened 4 weeks ago
Many plan diffs, will fix later.
For query which has Aggregation but without Group by clause, the DISTINCT/DISTINCT ON/ORDER BY clause could be removed as there would be one row returned at most.
SRF will break the assumption.
select count(*), generate_series(1, 4) from t1;
count | generate_series
-------+-----------------
3 | 1
3 | 2
3 | 3
3 | 4
(4 rows)
Fix it and Postgres' WITH ORDINALITY
as well.
I took a look at orca, it has already optimized distinct
function.
explain select distinct(count(a)) from foo;
QUERY PLAN
------------------------------------------------------------------------------------
Finalize Aggregate (cost=0.00..526.96 rows=1 width=8)
-> Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..526.96 rows=1 width=8)
-> Partial Aggregate (cost=0.00..526.96 rows=1 width=8)
-> Seq Scan on foo (cost=0.00..500.67 rows=3333334 width=4)
Optimizer: Pivotal Optimizer (GPORCA)
(5 rows)
Even if with group by
, the distinct
also can be removed
explain select distinct(count(a)) from foo group by a ;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..1395.69 rows=1000 width=8)
-> HashAggregate (cost=0.00..1395.66 rows=334 width=8)
Group Key: (count(a))
-> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..1395.62 rows=334 width=8)
Hash Key: (count(a))
-> Streaming HashAggregate (cost=0.00..1395.61 rows=334 width=8)
Group Key: count(a)
-> HashAggregate (cost=0.00..985.15 rows=3333334 width=8)
Group Key: a
Planned Partitions: 16
-> Redistribute Motion 3:3 (slice3; segments: 3) (cost=0.00..567.20 rows=3333334 width=4)
Hash Key: a
-> Seq Scan on foo (cost=0.00..500.67 rows=3333334 width=4)
Optimizer: Pivotal Optimizer (GPORCA)
(14 rows)
as distinct is a function which only works in a group.
The function called PexprRemoveSuperfluousDistinctInDQA
in orca.
I took a look at orca, it has already optimized
distinct
function.explain select distinct(count(a)) from foo; QUERY PLAN ------------------------------------------------------------------------------------ Finalize Aggregate (cost=0.00..526.96 rows=1 width=8) -> Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..526.96 rows=1 width=8) -> Partial Aggregate (cost=0.00..526.96 rows=1 width=8) -> Seq Scan on foo (cost=0.00..500.67 rows=3333334 width=4) Optimizer: Pivotal Optimizer (GPORCA) (5 rows)
Even if with
group by
, thedistinct
also can be removedexplain select distinct(count(a)) from foo group by a ; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------ Gather Motion 3:1 (slice1; segments: 3) (cost=0.00..1395.69 rows=1000 width=8) -> HashAggregate (cost=0.00..1395.66 rows=334 width=8) Group Key: (count(a)) -> Redistribute Motion 3:3 (slice2; segments: 3) (cost=0.00..1395.62 rows=334 width=8) Hash Key: (count(a)) -> Streaming HashAggregate (cost=0.00..1395.61 rows=334 width=8) Group Key: count(a) -> HashAggregate (cost=0.00..985.15 rows=3333334 width=8) Group Key: a Planned Partitions: 16 -> Redistribute Motion 3:3 (slice3; segments: 3) (cost=0.00..567.20 rows=3333334 width=4) Hash Key: a -> Seq Scan on foo (cost=0.00..500.67 rows=3333334 width=4) Optimizer: Pivotal Optimizer (GPORCA) (14 rows)
as distinct is a function which only works in a group.
The function called
PexprRemoveSuperfluousDistinctInDQA
in orca.
Yeah, see https://github.com/apache/cloudberry/discussions/677#discussioncomment-10966471
For query which has Aggregation but without Group by clause, the DISTINCT/DISTINCT ON/ORDER BY clause could be removed as there would be one row returned at most. And there is no necessary to do unique or sort. This can simply the plan, and process less expressions like: Aggref nodes during planner.
DISTINCT
After this commit:
DISTINCT ON and ORDER BY
After this commit:
ORDER BY
After this commit:
DISTINCT and ORDER BY
After this commit:
Authored-by: Zhang Mingli avamingli@gmail.com
fix #ISSUE_Number
Change logs
Describe your change clearly, including what problem is being solved or what feature is being added.
If it has some breaking backward or forward compatibility, please clary.
Why are the changes needed?
Describe why the changes are necessary.
Does this PR introduce any user-facing change?
If yes, please clarify the previous behavior and the change this PR proposes.
How was this patch tested?
Please detail how the changes were tested, including manual tests and any relevant unit or integration tests.
Contributor's Checklist
Here are some reminders and checklists before/when submitting your pull request, please check them:
make installcheck
make -C src/test installcheck-cbdb-parallel
cloudberrydb/dev
team for review and approval when your PR is ready🥳