[Bug] Optimizer could choose HashAggregate to dedup based on cost.

Cloudberry Database version

main

What happened

Optimizer could produce better plan to dedup based on cost.

What you think should happen instead

No response

How to reproduce

For the planner, we need to create better plan which uses HashAggregate to dedup columns. For example,

ebi_dwh=# explain analyze WITH UniqueIDs AS (
    SELECT DISTINCT gsp_id
    FROM eft_com_call_statistic
)
SELECT COUNT(*)
FROM UniqueIDs;
                                                                       QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=0.00..1419.47 rows=1 width=8) (actual time=2510.079..2510.081 rows=1 loops=1)
   ->  Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..1419.47 rows=1 width=8) (actual time=1998.786..2510.054 rows=4 loops=1)
         ->  Partial Aggregate  (cost=0.00..1419.47 rows=1 width=8) (actual time=1993.802..1993.803 rows=1 loops=1)
               ->  HashAggregate  (cost=0.00..1419.47 rows=2315005 width=1) (actual time=1294.980..1855.013 rows=2339996 loops=1)
                     Group Key: gsp_id
                     ->  Seq Scan on eft_com_call_statistic  (cost=0.00..969.33 rows=2375680 width=37) (actual time=0.343..339.551 rows=2377251 loops=1)
 Planning Time: 5.357 ms
   (slice0)    Executor memory: 98354K bytes.
   (slice1)    Executor memory: 244665K bytes avg x 4x(0) workers, 244755K bytes max (seg2).  Work_mem: 286737K bytes max.
 Memory used:  2097152kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 2584.618 ms
(12 rows)

This plan uses HashAggregate to perform gsp_id deduplication, while

 ebi_dwh=# explain analyze select count(distinct gsp_id) from eft_com_call_statistic;
                                                                    QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=0.00..1142.67 rows=1 width=8) (actual time=10573.416..10573.418 rows=1 loops=1)
   ->  Gather Motion 4:1  (slice1; segments: 4)  (cost=0.00..1142.67 rows=1 width=8) (actual time=10205.645..10573.387 rows=4 loops=1)
         ->  Partial Aggregate  (cost=0.00..1142.67 rows=1 width=8) (actual time=10437.005..10437.007 rows=1 loops=1)
               ->  Seq Scan on eft_com_call_statistic  (cost=0.00..969.33 rows=2375680 width=37) (actual time=0.269..548.673 rows=2377251 loops=1)
 Planning Time: 4.608 ms
   (slice0)    Executor memory: 37K bytes.
   (slice1)    Executor memory: 148480K bytes avg x 4x(0) workers, 148579K bytes max (seg2).
 Memory used:  2097152kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 10573.853 ms
(10 rows)

It seems the cost estimation is no accurate.

Operating System

No specific

Anything else

Are you willing to submit PR?

[ ] Yes, I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct.

For Postgres planner, in function add_single_dqa_hash_agg_path, the code doesn't match the comments. We need to investigate that:

        /*
         * 1. If the input's locus matches the DISTINCT, but not GROUP BY:
         *
         *  HashAggregate
         *     -> Redistribute (according to GROUP BY)
         *         -> HashAggregate (to eliminate duplicates)
         *             -> input (hashed by GROUP BY + DISTINCT)
         *
         * 2. If the input's locus matches the GROUP BY:
         *
         *  HashAggregate (to aggregate)
         *     -> HashAggregate (to eliminate duplicates)
         *           -> input (hashed by GROUP BY)
         *
         * The main planner should already have created the single-stage
         * Group Agg path.
         *
         * XXX: not sure if this makes sense. If hash distinct is a good
         * idea, why doesn't PostgreSQL's agg node implement that?
         */
        path = (Path *) create_agg_path(root,
                                        output_rel,
                                        path,
                                        input_target,
                                        AGG_HASHED,
                                        AGGSPLIT_SIMPLE,
                                        false, /* streaming */
                                        dqa_group_clause,
                                        NIL,
                                        ctx->agg_partial_costs, /* FIXME */
                                        clamp_row_est(dNumDistinctGroups / (double) num_input_segments));

        if (group_need_redistribute)
            path = cdbpath_create_motion_path(root, path, NIL, false,
                                              group_locus);

        path = (Path *) create_agg_path(root,
                                        output_rel,
                                        path,
                                        ctx->target,
                                        ctx->groupClause ? AGG_HASHED : AGG_PLAIN,
                                        AGGSPLIT_DEDUPLICATED,
                                        false, /* streaming */
                                        ctx->groupClause,

Do we have reproduce SQL steps?

Please check the description of this issue above. You can create a very simple table, and insert some data. Then, you can use the above SQL statements to reproduce this issue.

Hi all, I try to reproduce with some data, but got the opposite result: with_cte is slower than normal agg. Here is my case, I use AO table to make sure that there is no buffer cache when comparing results each time.

create table ao(id int) using ao_row;
insert into ao select i from generate_series(1, 1000000) i;
anaylze;

Normal AGG

 explain(analyze) select count(distinct id) from ao ;
                                                              QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
-------
 Finalize Aggregate  (cost=4268.72..4268.73 rows=1 width=8) (actual time=235.218..235.221 rows=1 loops=1)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=4268.67..4268.72 rows=3 width=8) (actual time=188.897..235.179 rows=3 lo
ops=1)
         ->  Partial Aggregate  (cost=4268.67..4268.68 rows=1 width=8) (actual time=234.069..234.071 rows=1 loops=1)
               ->  Seq Scan on ao  (cost=0.00..3435.33 rows=333333 width=4) (actual time=0.366..98.303 rows=334042 loops=1)
 Planning Time: 0.716 ms
   (slice0)    Executor memory: 114K bytes.
   (slice1)    Executor memory: 12290K bytes avg x 3x(0) workers, 12290K bytes max (seg0).
 Memory used:  128000kB
 Optimizer: Postgres query optimizer
 Execution Time: 236.235 ms
(10 rows)

Use a CTE SQL:

explain(analyze) WITH UniqueIDs AS (
    SELECT DISTINCT id
    FROM ao
)
SELECT COUNT(*)
FROM UniqueIDs;
                                                              QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
-------
 Finalize Aggregate  (cost=8435.39..8435.40 rows=1 width=8) (actual time=674.920..674.923 rows=1 loops=1)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=8435.33..8435.38 rows=3 width=8) (actual time=633.443..674.883 rows=3 lo
ops=1)
         ->  Partial Aggregate  (cost=8435.33..8435.34 rows=1 width=8) (actual time=623.181..623.184 rows=1 loops=1)
               ->  HashAggregate  (cost=4268.67..7602.00 rows=333333 width=4) (actual time=419.478..583.788 rows=334042 loops=1
)
                     Group Key: ao.id
                     ->  Seq Scan on ao  (cost=0.00..3435.33 rows=333333 width=4) (actual time=0.534..117.273 rows=334042 loops
=1)
 Planning Time: 0.888 ms
   (slice0)    Executor memory: 12333K bytes.
   (slice1)    Executor memory: 22960K bytes avg x 3x(0) workers, 22960K bytes max (seg0).  Work_mem: 30737K bytes max.
 Memory used:  128000kB
 Optimizer: Postgres query optimizer
 Execution Time: 696.665 ms
(12 rows)

Same for ORCA:

explain(analyze) select count(distinct id) from ao ;
                                                            QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
---
 Finalize Aggregate  (cost=0.00..440.60 rows=1 width=8) (actual time=326.314..326.317 rows=1 loops=1)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..440.60 rows=1 width=8) (actual time=236.042..326.277 rows=3 loops=
1)
         ->  Partial Aggregate  (cost=0.00..440.60 rows=1 width=8) (actual time=234.961..234.964 rows=1 loops=1)
               ->  Seq Scan on ao  (cost=0.00..437.97 rows=333334 width=4) (actual time=0.354..105.905 rows=334042 loops=1)
 Planning Time: 29.959 ms
   (slice0)    Executor memory: 114K bytes.
   (slice1)    Executor memory: 12290K bytes avg x 3x(0) workers, 12290K bytes max (seg0).
 Memory used:  128000kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 327.857 ms
(10 rows)

Use CTE SQL:

explain(analyze) WITH UniqueIDs AS (
    SELECT DISTINCT id
    FROM ao
)
SELECT COUNT(*)
FROM UniqueIDs;
                                                            QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------
----
 Finalize Aggregate  (cost=0.00..480.67 rows=1 width=8) (actual time=1068.652..1068.655 rows=1 loops=1)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..480.67 rows=1 width=8) (actual time=789.968..1068.613 rows=3 loops
=1)
         ->  Partial Aggregate  (cost=0.00..480.67 rows=1 width=8) (actual time=785.682..785.685 rows=1 loops=1)
               ->  HashAggregate  (cost=0.00..480.67 rows=333334 width=1) (actual time=435.946..748.697 rows=334042 loops=1)
                     Group Key: id
                     ->  Seq Scan on ao  (cost=0.00..437.97 rows=333334 width=4) (actual time=0.534..120.923 rows=334042 loops=
1)
 Planning Time: 41.497 ms
   (slice0)    Executor memory: 12333K bytes.
   (slice1)    Executor memory: 22960K bytes avg x 3x(0) workers, 22960K bytes max (seg0).  Work_mem: 30737K bytes max.
 Memory used:  128000kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 1089.742 ms
(12 rows)

I think you had reproduced the problem: we generated two different plans for those two SQL statements which are functionally equivalent. One uses the GroupAggregate node to perform deduplication, while other uses the HashAggregate to perform deduplication.

You can easily create a case where HashAggregate for deduplication is more efficient than GroupAggregate.

Based on the table you created, try this:

insert into ao select i from generate_series(1, 1000) i;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;

And then, run the tests again.

I think you had reproduced the problem: we generated two different plans for those two SQL statements which are functionally equivalent. One uses the GroupAggregate node to perform deduplication, while other uses the HashAggregate to perform deduplication.

You can easily create a case where HashAggregate for deduplication is more efficient than GroupAggregate.

Based on the table you created, try this:
insert into ao select i from generate_series(1, 1000) i;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
insert into ao select * from ao;
And then, run the tests again.

Got it, now this case, CTE works better.

we generated two different plans for those two SQL statements which are functionally equivalent.

I debug this for a while:

For CTE SQL, the HashAgg on SeqScan came from CTE path, and planner do a count agg on the CTE subquery, so the plan has an additional DEDUP HashAgg node, and it must be that one.

For SQL: select count(distinct id) from ao; I found we already have the codes to add an additional DEDUP HashAgg path in PG planner. Yes, there was a path like that, HashAgg on Seqscan to DEDUP, but it's dropped due to cost compared to a normal two-stage aggregation path by planner. Either is correct for : select count(distinct id) from ao;.

The cost of HashAggregate to dedup may need to be adjusted.

we generated two different plans for those two SQL statements which are functionally equivalent.

I debug this for a while:

For CTE SQL, the HashAgg on SeqScan came from CTE path, and planner do a count agg on the CTE subquery, so the plan has an additional DEDUP HashAgg node, and it must be that one.

For SQL: select count(distinct id) from ao; I found we already have the codes to add an additional DEDUP HashAgg path in PG planner. Yes, there was a path like that, HashAgg on Seqscan to DEDUP, but it's dropped due to cost compared to a normal two-stage aggregation path by planner. Either is correct for : select count(distinct id) from ao;.

The cost of HashAggregate to dedup may need to be adjusted.

Yes, in fact, dedup + count is much faster than count(distinct) in some cases. We need to adjust cost estimation very carefully.

For SQL: select count(distinct id) from ao; I found we already have the codes to add an additional DEDUP HashAgg path in PG planner. Yes, there was a path like that, HashAgg on Seqscan to DEDUP, but it's dropped due to cost compared to a normal two-stage aggregation path by planner.

Correction: after some debug, I found there is no such a path, the DEDUP HashAgg currently we have is much like a one-phase agg:

explain(analyze)  select count(distinct id) from ao ;
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------
-----------------
 Aggregate  (cost=20917.17..20917.18 rows=1 width=8) (actual time=1237.827..1237.830 rows=1 loops=1)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=20898.00..20914.67 rows=1000 width=4) (actual time=1131.395..1237.586 ro
ws=1000 loops=1)
         ->  HashAggregate  (cost=20898.00..20901.33 rows=333 width=4) (actual time=1212.114..1212.211 rows=340 loops=1)
               Group Key: id
               ->  Seq Scan on ao  (cost=0.00..14071.33 rows=1365333 width=4) (actual time=1.418..646.267 rows=1392640 loops=1)
 Planning Time: 0.761 ms
   (slice0)    Executor memory: 114K bytes.
   (slice1)    Executor memory: 286K bytes avg x 3x(0) workers, 286K bytes max (seg2).  Work_mem: 61K bytes max.
 Memory used:  128000kB
 Optimizer: Postgres query optimizer
 Execution Time: 1238.826 ms
(11 rows)

That's not we expected, it gather all live tuples to do agg on master. We need to generate a plan to handle DISTINCT as the CTE plan does and let it win according to user's cases. That is much like 3-phase AGG in GPDB, and I'm working on it.

Hi all, fixed in #676

explain(costs off)
select count(distinct a) from t_issue_659;
                   QUERY PLAN
-------------------------------------------------
 Finalize Aggregate
   ->  Gather Motion 3:1  (slice1; segments: 3)
         ->  Partial Aggregate
               ->  HashAggregate
                     Group Key: a
                     ->  Seq Scan on t_issue_659
 Optimizer: Postgres query optimizer
(7 rows)

Planner may choose the new plan by cost. And as discussion, we provide a GUC gp_eager_distinct_dedup for users.

set gp_eager_distinct_dedup = on;

Set it to true will make planner use our new plan.

Does it only work when the key of distinct is the same as the distribution key, or it doesn't matter?

Does it only work when the key of distinct is the same as the distribution key, or it doesn't matter?

GPDB already have 3-phase agg plan when distinct key is not same with distribution key or group key.

After #676 , it should work for both cases, whether distinct key is same with distribution key or not.

Close due to already fixed.

Hi Team,

Eddie provided us with a test RPM package containing the single fix for issue #676 (already cherry-picked and built into v1.6.0): Implement 3-phase aggregation with DEDUP HashAgg for DISTINCT. #676

We conducted some real-data testing, hoping for better performance with optimizer=on (needed for our typical workload) and gp_eager_distinct_dedup=on. Unfortunately, the PostgreSQL optimizer still outperforms ORCA by a factor of two. We've attached the EXPLAIN ANALYZE output for your reference. fix_659_distinct_cdb161.txt

Best regards,

Andreas

I am reopening this so there can be a discussion in response to aeschbacherEUF comment.

`set gp_eager_distinct_dedup = on; set optimizer = on; set optimizer_enable_multiple_distinct_aggs = off;

explain (analyze, verbose, costs, timing, buffers) select count(distinct gsp_id) from eft_com_call_statistic;` QUERY PLAN

Finalize Aggregate (cost=0.00..1252.01 rows=1 width=8) (actual time=9613.853..9613.854 rows=1 loops=1) Output: count(DISTINCT gsp_id) -> Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..1252.01 rows=1 width=8) (actual time=9408.391..9613.841 rows=2 loops=1) Output: (PARTIAL count(DISTINCT gsp_id)) -> Partial Aggregate (cost=0.00..1252.01 rows=1 width=8) (actual time=9612.691..9612.693 rows=1 loops=1) Output: PARTIAL count(DISTINCT gsp_id) -> Seq Scan on ebi_data.eft_com_call_statistic (cost=0.00..1103.70 rows=3958250 width=19) (actual time=0.222..940.759 rows=3958373 loops=1) Output: gsp_id Settings: effective_cache_size = '8GB', optimizer = 'on' Planning: Buffers: shared hit=2 Planning Time: 5.153 ms (slice0) Executor memory: 37K bytes. (slice1) Executor memory: 130705K bytes avg x 2x(0) workers, 130705K bytes max (seg0). Memory used: 131072kB Optimizer: Pivotal Optimizer (GPORCA) Execution Time: 9614.687 ms (17 rows)

Time: 9620,598 ms (00:09,621)

==================================================================================

`set gp_eager_distinct_dedup = on; set optimizer = on; set optimizer_enable_multiple_distinct_aggs = off;

explain (analyze, verbose, costs, timing, buffers) WITH cte_gsp_counts AS ( SELECT DISTINCT gsp_id FROM eft_com_call_statistic ) SELECT COUNT(*) FROM cte_gsp_counts; ` QUERY PLAN

Finalize Aggregate (cost=0.00..1721.24 rows=1 width=8) (actual time=3438.953..3438.955 rows=1 loops=1) Output: count() -> Gather Motion 2:1 (slice1; segments: 2) (cost=0.00..1721.24 rows=1 width=8) (actual time=3423.515..3438.943 rows=2 loops=1) Output: (PARTIAL count()) -> Partial Aggregate (cost=0.00..1721.24 rows=1 width=8) (actual time=3433.644..3433.645 rows=1 loops=1) Output: PARTIAL count() -> HashAggregate (cost=0.00..1721.24 rows=3958250 width=1) (actual time=1993.244..3310.569 rows=3958365 loops=1) Output: gsp_id Group Key: eft_com_call_statistic.gsp_id Planned Partitions: 4 work_mem: 262433kB Segments: 2 Max: 131217kB (segment 0) Workfile: (2 spilling) -> Seq Scan on ebi_data.eft_com_call_statistic (cost=0.00..1103.70 rows=3958250 width=19) (actual time=0.376..915.290 rows=3958373 loops=1) Output: gsp_id Settings: effective_cache_size = '8GB', optimizer = 'on' Planning: Buffers: shared hit=6 Planning Time: 4.713 ms (slice0) Executor memory: 24626K bytes. (slice1) Executor memory: 123097K bytes avg x 2x(0) workers, 123117K bytes max (seg0). Work_mem: 131217K bytes max. Memory used: 131072kB Optimizer: Pivotal Optimizer (GPORCA) Execution Time: 3448.772 ms (22 rows)

Time: 3454,306 ms (00:03,454)

Hi @aeschbacherEUF, please try it with optimizer = off;

that will speed the count distinct but for a lot of other queries we would need to have optimizer set to on

that will speed the count distinct but for a lot of other queries we would need to have optimizer set to on

Yeah. We also need to enable it for ORCA.

We found ORCA already have the path. Maybe need to adjust the cost.

my test case is

create table t1(v1 int, v2 int, v3 int);
insert into t1 values(generate_series(0, 100000), generate_series(100000, 200000), generate_series(200000, 300000));
insert into t1 select * from t1; -- do n times

and the ORCA memo is:

postgres=# explain analyze WITH UniqueIDs AS (
    SELECT DISTINCT v2
    FROM t1
)
SELECT COUNT(*)
FROM UniqueIDs;
LOG:  statement: explain analyze WITH UniqueIDs AS (
    SELECT DISTINCT v2
    FROM t1
)
SELECT COUNT(*)
FROM UniqueIDs;
LOG:  2024-11-18 14:15:17:640135 CST,THD000,TRACE,"MEMO after optimization (stage 0):
",
2024-11-18 14:15:17:642637 CST,THD000,TRACE,"
Group 19 ():
  0: CScalarConst (1) [ ]

Group 18 ():
  0: CScalarProjectList [ 17 ]

Group 17 ():
  0: CScalarProjectElement "count" (22) [ 16 ]

Group 16 ():
  0: CScalarAggFunc (count , Distinct: false , Aggregate Stage: Global) [ 15 1 1 1 ]

Group 15 ():
  0: CScalarValuesList [ 14 ]

Group 14 ():
  0: CScalarIdent "ColRef_0023" (23) [ ]

Group 13 (#GExprs: 3):
  0: CLogicalGbAgg( Local ) Grp Cols: [][Local], Minimal Grp Cols: [], Generates Duplicates :[ 1 ]  [ 0 12 ]
  1: CPhysicalScalarAgg( Local, multi-stage ) Grp Cols: [], Minimal Grp Cols:[], Generates Duplicates :[ 1 ]  [ 0 12 ]
    Cost Ctxts:
      main ctxt (stage 0)1.0, child ctxts:[1], rows:1.000000 (group), cost: 1093.902454
      main ctxt (stage 0)1.1, child ctxts:[2], rows:1.000000 (group), cost: 1094.007472
  2: CPhysicalMotionGather(master) [ 13 ]
    Cost Ctxts:
      main ctxt (stage 0)0.0, child ctxts:[1], rows:1.000000 (group), cost: 1093.902484
  Grp OptCtxts:
    0 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [SINGLETON (master) match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:2
    1 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [ANY  EOperatorId: 129  match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:1

Group 12 ():
  0: CScalarProjectList [ 11 ]

Group 11 ():
  0: CScalarProjectElement "ColRef_0023" (23) [ 10 ]

Group 10 ():
  0: CScalarAggFunc (count , Distinct: false , Aggregate Stage: Local) [ 1 1 1 1 ]

Group 9 (#GExprs: 5):
  0: CLogicalGbAgg( Local ) Grp Cols: ["v2" (11)][Local], Minimal Grp Cols: ["v2" (11)], Generates Duplicates :[ 1 ]  [ 7 8 ]
  1: CPhysicalStreamAgg( Local, multi-stage ) Grp Cols: ["v2" (11)], Minimal Grp Cols:["v2" (11)], Generates Duplicates :[ 1 ]  [ 7 8 ]
    Cost Ctxts:
      main ctxt (stage 0)1.0, child ctxts:[4], rows:100656.000000 (group), cost: 2719.140939
      main ctxt (stage 0)1.1, child ctxts:[3], rows:100656.000000 (group), cost: 2772.560140
  2: CPhysicalHashAgg( Local, multi-stage ) Grp Cols: ["v2" (11)], Minimal Grp Cols:["v2" (11)], Generates Duplicates :[ 1 ]  (High) [ 7 8 ]
    Cost Ctxts:
      main ctxt (stage 0)1.0, child ctxts:[1], rows:100656.000000 (group), cost: 1089.433563
      main ctxt (stage 0)1.1, child ctxts:[2], rows:100656.000000 (group), cost: 1142.852763
  3: CPhysicalMotionHashDistribute HASHED: [ CScalarIdent "v2" (11), nulls colocated ], opfamilies: (1977,1.0), [ 9 ]
    Cost Ctxts:
      main ctxt (stage 0)0.0, child ctxts:[1], rows:100656.000000 (group), cost: 1089.853634
  4: CPhysicalSort  ( (97,1.0), "v2" (11), NULLsLast )  [ 9 ]
    Cost Ctxts:
      main ctxt (stage 0)2.0, child ctxts:[0], rows:100656.000000 (group), cost: 1101.293981
  Grp OptCtxts:
    0 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [HASHED: [ CScalarIdent "v2" (11), nulls colocated ], opfamilies: (1977,1.0), match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:3
    1 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [ANY  EOperatorId: 131  match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:2
    2 (stage 0): (req CTEs: [0:p(opt) ], req order: [( (97,1.0), "v2" (11), NULLsLast )  match: satisfy ], req dist: [HASHED: [ CScalarIdent "v2" (11), nulls colocated ], opfamilies: (1977,1.0), match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:4

Group 8 ():
  0: CScalarProjectList [ ]

Group 7 (#GExprs: 5):
  0: CLogicalGet "t1" ("t1"), Columns: ["v1" (12), "v2" (11), "v3" (13), "ctid" (14), "xmin" (15), "cmin" (16), "xmax" (17), "cmax" (18), "tableoid" (19), "gp_segment_id" (20), "gp_foreign_server" (21)] Key sets: {[3,9]} [ ]
  1: CPhysicalTableScan "t1" ("t1") [ ]
    Cost Ctxts:
      main ctxt (stage 0)1.0, child ctxts:[], rows:12800128.000000 (group), cost: 538.947746
  2: CPhysicalMotionHashDistribute HASHED: [ CScalarIdent "v2" (11), nulls colocated ], opfamilies: (1977,1.0), [ 7 ]
    Cost Ctxts:
      main ctxt (stage 0)0.0, child ctxts:[1], rows:12800128.000000 (group), cost: 624.111264
  3: CPhysicalMotionRandom [ 7 ]
    Cost Ctxts:
      main ctxt (stage 0)2.0, child ctxts:[1], rows:12800128.000000 (group), cost: 624.111264
  4: CPhysicalSort  ( (97,1.0), "v2" (11), NULLsLast )  [ 7 ]
    Cost Ctxts:
      main ctxt (stage 0)4.0, child ctxts:[1], rows:12800128.000000 (group), cost: 2701.998811
      main ctxt (stage 0)5.0, child ctxts:[0], rows:12800128.000000 (group), cost: 2755.418012
      main ctxt (stage 0)3.0, child ctxts:[2], rows:12800128.000000 (group), cost: 2755.418012
  Grp OptCtxts:
    0 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [HASHED: [ CScalarIdent "v2" (11), nulls colocated ], opfamilies: (1977,1.0), match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:2
    1 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [ANY  EOperatorId: 131  match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:1
    2 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [RANDOM match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:3
    3 (stage 0): (req CTEs: [0:p(opt) ], req order: [( (97,1.0), "v2" (11), NULLsLast )  match: satisfy ], req dist: [RANDOM match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:4
    4 (stage 0): (req CTEs: [0:p(opt) ], req order: [( (97,1.0), "v2" (11), NULLsLast )  match: satisfy ], req dist: [ANY  EOperatorId: 136  match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:4
    5 (stage 0): (req CTEs: [0:p(opt) ], req order: [( (97,1.0), "v2" (11), NULLsLast )  match: satisfy ], req dist: [HASHED: [ CScalarIdent "v2" (11), nulls colocated ], opfamilies: (1977,1.0), match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:4

ROOT
Group 6 (#GExprs: 3):
  0: CLogicalCTEAnchor (0) [ 5 ]
  1: CLogicalSelect [ 5 19 ]
  2: CPhysicalFilter [ 5 19 ]
    Cost Ctxts:
      main ctxt (stage 0)0.1, child ctxts:[0], rows:1.000000 (group), cost: 1093.902485 -- group 5 outer 
  Grp OptCtxts:
    0 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [SINGLETON (master) match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:2

Group 5 (#GExprs: 4):
  0: CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 0 ]  [ 0 4 ]
  1: CLogicalGbAgg( Global ) Grp Cols: [][Global], Minimal Grp Cols: [], Generates Duplicates :[ 1 ]  [ 13 18 ]
  2: CPhysicalScalarAgg( Global, multi-stage ) Grp Cols: [], Minimal Grp Cols:[], Generates Duplicates :[ 1 ]  [ 13 18 ]
    Cost Ctxts:
      main ctxt (stage 0)0.1, child ctxts:[0], rows:1.000000 (group), cost: 1093.902485
  3: CPhysicalScalarAgg( Global ) Grp Cols: [], Minimal Grp Cols:[], Generates Duplicates :[ 0 ]  [ 0 4 ]
    Cost Ctxts:
      main ctxt (stage 0)0.1, child ctxts:[0], rows:1.000000 (group), cost: 1094.277565
  Grp OptCtxts:
    0 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [ANY  EOperatorId: 101  match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:2

Group 4 ():
  0: CScalarProjectList [ 3 ]

Group 3 ():
  0: CScalarProjectElement "count" (22) [ 2 ]

Group 2 ():
  0: CScalarAggFunc (count , Distinct: false , Aggregate Stage: Global) [ 1 1 1 1 ]

Group 1 ():
  0: CScalarValuesList [ ]

Group 0 (#GExprs: 10):
  0: CLogicalCTEConsumer (0), Columns: ["v2" (11)] [ ]
  1: CLogicalGbAgg( Global ) Grp Cols: ["v2" (11)][Global], Minimal Grp Cols: ["v2" (11)], Generates Duplicates :[ 0 ]  [ 7 8 ]
  2: CLogicalGbAgg( Global ) Grp Cols: ["v2" (11)][Global], Minimal Grp Cols: ["v2" (11)], Generates Duplicates :[ 1 ]  [ 9 8 ]
  3: CPhysicalStreamAgg( Global, multi-stage ) Grp Cols: ["v2" (11)], Minimal Grp Cols:["v2" (11)], Generates Duplicates :[ 1 ]  [ 9 8 ]
    Cost Ctxts:
      main ctxt (stage 0)1.1, child ctxts:[2], rows:100656.000000 (group), cost: 1101.447012
  4: CPhysicalHashAgg( Global, multi-stage ) Grp Cols: ["v2" (11)], Minimal Grp Cols:["v2" (11)], Generates Duplicates :[ 1 ]  (High) [ 9 8 ]
    Cost Ctxts:
      main ctxt (stage 0)1.1, child ctxts:[0], rows:100656.000000 (group), cost: 1093.902454
  5: CPhysicalStreamAgg( Global ) Grp Cols: ["v2" (11)], Minimal Grp Cols:["v2" (11)], Generates Duplicates :[ 0 ]  [ 7 8 ]
    Cost Ctxts:
      main ctxt (stage 0)1.1, child ctxts:[5], rows:100656.000000 (group), cost: 2772.503672
  6: CPhysicalHashAgg( Global ) Grp Cols: ["v2" (11)], Minimal Grp Cols:["v2" (11)], Generates Duplicates :[ 0 ]  (High) [ 7 8 ]
    Cost Ctxts:
      main ctxt (stage 0)1.1, child ctxts:[0], rows:100656.000000 (group), cost: 1136.613079
  7: CPhysicalCTEConsumer (0), Columns: ["v2" (11)] [ ]
    Cost Ctxts:
  8: CPhysicalMotionGather(master) [ 0 ]
    Cost Ctxts:
      main ctxt (stage 0)0.0, child ctxts:[1], rows:100656.000000 (group), cost: 1094.277565
  9: CPhysicalMotionRandom [ 0 ]
    Cost Ctxts:
      main ctxt (stage 0)2.0, child ctxts:[1], rows:100656.000000 (group), cost: 1094.007472
  Grp OptCtxts:
    0 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [SINGLETON (master) match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:8
    1 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [ANY  EOperatorId: 129  match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:4
    2 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [RANDOM match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:9
",
                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate  (cost=0.00..1093.90 rows=1 width=8) (actual time=3749.832..3749.845 rows=1 loops=1)
   ->  Gather Motion 3:1  (slice1; segments: 3)  (cost=0.00..1093.90 rows=1 width=8) (actual time=3744.143..3749.815 rows=3 loops=1)
         ->  Partial Aggregate  (cost=0.00..1093.90 rows=1 width=8) (actual time=3748.275..3748.276 rows=1 loops=1)
               ->  HashAggregate  (cost=0.00..1093.90 rows=33552 width=1) (actual time=3728.835..3738.960 rows=33501 loops=1)
                     Group Key: v2
                     ->  Redistribute Motion 3:3  (slice2; segments: 3)  (cost=0.00..1089.85 rows=33552 width=4) (actual time=3277.019..3706.719 rows=33501 loops=1)
                           Hash Key: v2
                           ->  Streaming HashAggregate  (cost=0.00..1089.43 rows=33552 width=4) (actual time=3694.054..3706.532 rows=33462 loops=1)
                                 Group Key: v2
                                 ->  Seq Scan on t1  (cost=0.00..538.95 rows=4266710 width=4) (actual time=0.175..1717.717 rows=4283136 loops=1)
 Planning Time: 60.859 ms
   (slice0)    Executor memory: 3139K bytes.
   (slice1)    Executor memory: 2617K bytes avg x 3x(0) workers, 2623K bytes max (seg2).  Work_mem: 3601K bytes max.
   (slice2)    Executor memory: 2615K bytes avg x 3x(0) workers, 2619K bytes max (seg0).  Work_mem: 3601K bytes max.
 Memory used:  128000kB
 Optimizer: Pivotal Optimizer (GPORCA)
 Execution Time: 3762.223 ms
(17 rows)

then the lowest cost path is:

ROOT -> 5 9 * [0] (CPhysicalFilter)
    -> 5 * 0  (CPhysicalScalarAgg)
        -> 13 * 0  (CPhysicalMotionGather)
            -> 0 * 1 (CPhysicalHashAgg)
                -> 9 * 0 (CPhysicalMotionHashDistribute)
                  -> 9 * 1 (CPhysicalHashAgg)
                    -> 7 * 1 (CPhysicalTableScan)

let us see the group 7(only one path to do the seq scan):

Group 7 (#GExprs: 5):
  ...
  1: CPhysicalTableScan "t1" ("t1") [ ]
    Cost Ctxts:
      main ctxt (stage 0)1.0, child ctxts:[], rows:12800128.000000 (group), cost: 538.947746
  ...
  1 (stage 0): (req CTEs: [0:p(opt) ], req order: [<empty> match: satisfy ], req dist: [ANY  EOperatorId: 131  match: satisfy], req rewind: [], req rewind: [NONE NO-MOTION match: satisfy], req partition propagation: [<empty> match: satisfy ]) => Best Expr:1

then below path can add group 7, expr 1 as its child:

group 7,expr 2 - CPhysicalMotionHashDistribute
group 7,expr 3 - CPhysicalMotionRandom
group 7,expr 3 - Sort(high cost)
group 9,expr 1 - CPhysicalHashAgg

The parent path of path abc group 9,expr 1 the same as group 7, expr 1, and only CPhysicalMotionHashDistribute/CPhysicalMotionRandom/Sort path exists.

So i guess we need add a path in the ORCA...