StarRocks / starrocks

StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries.
https://starrocks.io
Apache License 2.0
8.73k stars 1.75k forks source link

multi_distinct_count function performance issues #36695

Open zenoyang opened 9 months ago

zenoyang commented 9 months ago

Enhancement

We have a query with multiple count distinct indicators and a limit. Because of the limit, the optimizer did not convert it into a group by + count query plan, and finally used the multi_distinct_count function for deduplication. But the performance is too poor and the query times out.

For example:

select
entrance, count(distinct event_id) as event_num
from db.tbl
where dt = '20231129'
group by 1

This query is very fast, with BE returning results in tens of seconds.

select
entrance, multi_distinct_count(event_id) as event_num
from db.tbl
where dt = '20231129'
group by 1

The query is very slow and the query times out in 50 minutes.

note: event_id is varchar type, cardinality is about 300 million.

profile is as follows:

| Fragment 1                                                                                                                                        |
| │   BackendNum: 1                                                                                                                                 |
| │   InstancePeakMemoryUsage: 20.176 GB, InstanceAllocatedMemoryUsage: 21.414 GB                                                                   |
| │   PrepareTime: 234.070us                                                                                                                        |
| └──DATA_STREAM_SINK (id=5)                                                                                                                        |
|    │   PartitionType: UNPARTITIONED                                                                                                               |
|    └──AGGREGATION[finalize, merge]                                                                                                           |
|       │   Estimates: [row: 236946618, cpu: ?, memory: ?, network: ?, cost: 1.9903515936600003E11]                                            |
|       │   TotalTime: 44m31s (84.87%) [CPUTime: 44m31s]                                                                                       |
|       │   OutputRows: 0                                                                                                                      |
|       │   PeakMemory: 6.125 GB, AllocatedMemory: 10.785 GB                                                                                   |
|       │   AggExprs: [multi_distinct_count(56: multi_distinct_count)]                                                                         |
|       │   GroupingExprs: [47: entrance]                                                                                          |
|       │   Detail Timers:                                                                                                                     |
|       │       AggComputeTime: 5m33s [min=0ns, max=44m31s]                                                                                    |
|       │       AggFuncComputeTime: 5m33s [min=0ns, max=44m31s]                                                                                |
|       └──EXCHANGE[SHUFFLE]                                                                                                                        |
|              Estimates: [row: 473893236, cpu: ?, memory: ?, network: ?, cost: 1.8197500285200003E11]                                              |
|              TotalTime: 2s313ms (0.07%) [CPUTime: 2s312ms, NetworkTime: 655.616us]                                                                |
|              OutputRows: 2                                                                                                                        |
|              PeakMemory: 3.500 GB, AllocatedMemory: 28.001 GB                                                                                     |
|                                                                                                                                                       |
| Fragment 2                                                                                                                                        |
| │   BackendNum: 1                                                                                                                                 |
| │   InstancePeakMemoryUsage: 46.686 GB, InstanceAllocatedMemoryUsage: 444.283 GB                                                                  |
| │   PrepareTime: 10.081ms                                                                                                                         |
| └──DATA_STREAM_SINK (id=3)                                                                                                                        |
|    │   PartitionType: HASH_PARTITIONED                                                                                                            |
|    │   PartitionExprs: [47: entrance]                                                                                                 |
|    └──AGGREGATION[serialize, update]                                                                                                              |
|       │   Estimates: [row: 473893236, cpu: ?, memory: ?, network: ?, cost: 1.5922812750000003E11]                                                 |
|       │   TotalTime: 7m22s (14.06%) [CPUTime: 7m22s]                                                                                              |
|       │   OutputRows: 8                                                                                                                           |
|       │   PeakMemory: 1.985 GB, AllocatedMemory: 55.153 GB                                                                                        |
|       │   AggExprs: [multi_distinct_count(1: event_id)]                                                                                           |
|       │   GroupingExprs: [47: entrance]                                                                                               |
|       │   SubordinateOperators:                                                                                                                   |
|       │       LOCAL_EXCHANGE [Passthrough]                                                                                                        |
|       └──PROJECT                                                                                                                                  |
|          │   Estimates: [row: ?, cpu: ?, memory: ?, network: ?, cost: ?]                                                                          |
|          │   TotalTime: 46.289ms (0.00%) [CPUTime: 46.289ms]                                                                                      |
|          │   OutputRows: 338.743M (338742764)                                                                                                     |
|          │   Expression: [1: event_id, 47: entrance]                                                                                  |
|          └──HDFS_SCAN                                                                                                                             |
|                 Estimates: [row: 947786473, cpu: 30329167144.00, memory: 30329167144.00, network: 30329167144.00, cost: 121316668576.00]          |
|                 TotalTime: 31s94ms (0.99%) [CPUTime: 227.344ms, ScanTime: 30s867ms]                                                               |
|                 OutputRows: 338.743M (338742764)                                                                                                  |
|                 PeakMemory: 6.103 GB, AllocatedMemory: 373.265 GB                                                                                 |
|                 SubordinateOperators:                                                                                                             |
|                     CHUNK_ACCUMULATE

The perf analysis results are as follows: image

The main bottleneck is the phmap::priv::raw_hash_set<phmap::priv::FlatHashSetPolicystarrocks::SliceWithHash, starrocks::HashOnSliceWithHash, starrocks::EqualOnSliceWithHash, std::allocatorstarrocks::SliceWithHash >::prepare_insert method inside the starrocks::DistinctAggregateState<(starrocks::LogicalType)13, (starrocks::LogicalType)13, int>::deserialize_and_merge method.

Youngwb commented 9 months ago

try this set new_planner_agg_stage = 3

zenoyang commented 9 months ago

try this set new_planner_agg_stage = 3

Still very slow, our production environment new_planner_agg_stage is 4.

github-actions[bot] commented 3 months ago

We have marked this issue as stale because it has been inactive for 6 months. If this issue is still relevant, removing the stale label or adding a comment will keep it active. Otherwise, we'll close it in 10 days to keep the issue queue tidy. Thank you for your contribution to StarRocks!

839224346 commented 3 months ago

Mark

lipenglin commented 3 months ago

@imay PTAL

LiShuMing commented 1 week ago

Which version are you using? Can you try this pr(https://github.com/StarRocks/starrocks/pull/40549) which can do better for count distinct rewrite strategy.

zenoyang commented 5 days ago

Which version are you using? Can you try this pr(#40549) which can do better for count distinct rewrite strategy.

Version 3.2.8, yes, we have tried this PR. After turning on the prefer_cte_rewrite parameter, most count distinct can be converted to group by + count execution, and the performance is good enough.

However, there are still very few cases where the plan cannot be rewritten, such as: the case where the distinct column has complex case when. @LiShuMing

zenoyang commented 5 days ago

Supplement: If rewrite is not possible (multi_distinct_count will still be executed), it is very easy to cause BE OOM. Currently, large query fuse is used to avoid this.