StarRocks / starrocks

The world's fastest open query engine for sub-second analytics both on and off the data lakehouse. With the flexibility to support nearly any scenario, StarRocks provides best-in-class performance for multi-dimensional analytics, real-time analytics, and ad-hoc queries. A Linux Foundation project.
https://starrocks.io
Apache License 2.0
9.25k stars 1.84k forks source link

Slow performance with filter+group by+order operators #38540

Open sergeyshaykhullin opened 11 months ago

sergeyshaykhullin commented 11 months ago

With order by query profile contains extra data shuffling, but where + group by retain just single row

Below just an example with required columns and single day of aggregation, in production environment +- 40 colums and wide date ranges

Query time drops from 400ms to 1800+ms with the same result

Steps to reproduce the behavior (Required)

1.

CREATE TABLE `v1_product_histories` (
  `created_at` date NOT NULL COMMENT "",
  `product_external_id` int(11) NOT NULL COMMENT "",
  `orders_cost` int(11) NOT NULL COMMENT ""
) ENGINE=OLAP 
DUPLICATE KEY(`created_at`, `product_external_id`)
COMMENT "OLAP"
PARTITION BY date_trunc('day', created_at)
DISTRIBUTED BY HASH(`product_external_id`)
PROPERTIES (
  "replication_num" = "1",
  "datacache.enable" = "true",
  "storage_volume" = "builtin_storage_volume",
  "enable_async_write_back" = "false",
  "enable_persistent_index" = "false",
  "compression" = "LZ4"
);
  1. Populate with data

  2. 400ms

    select
    product_external_id,
    sum(orders_cost) as orders_cost
    from v1_product_histories
    where created_at >= '2024-01-04' and created_at <= '2024-01-04' and product_external_id = 123
    group by product_external_id
    limit 200
  3. 1800+ms

    select
    product_external_id,
    sum(orders_cost) as orders_cost
    from v1_product_histories
    where created_at >= '2024-01-04' and created_at <= '2024-01-04' and product_external_id = 123
    group by product_external_id
    -- HERE
    order by orders_cost
    limit 200

Expected behavior (Required)

Ordering happen with no cost on single row and without extra shuffilng

Summary
    QueryId: ddd1bb88-aba4-11ee-a1ef-42c1c06ca0c3
    Version: 3.2.2-269e832
    State: Finished
    TotalTime: 282ms
        ExecutionTime: 225.124ms [Scan: 6.463ms (2.87%), Network: 127.735ms (56.74%), ResultDeliverTime: 0ns (0.00%), ScheduleTime: 1.348ms (0.60%)]
        CollectProfileTime: 22ms
        FrontendProfileMergeTime: 8.972ms
...

Real behavior (Required)

Summary
    QueryId: 0b5be7f5-aba5-11ee-a1ef-42c1c06ca0c3
    Version: 3.2.2-269e832
    State: Finished
    TotalTime: 1s560ms
        ExecutionTime: 464.239ms [Scan: 8.641ms (1.86%), Network: 823.937ms (177.48%), ResultDeliverTime: 0ns (0.00%), ScheduleTime: 4.962ms (1.07%)]
        CollectProfileTime: 954ms
        FrontendProfileMergeTime: 86.840ms
...

StarRocks version (Required)

3.2.2-269e832

Non default variables

    NonDefaultVariables:
        character_set_results: utf8 -> NULL
        enable_adaptive_sink_dop: false -> true
        enable_async_profile: true -> false
        enable_connector_adaptive_io_tasks: false -> true
        enable_iceberg_column_statistics: false -> true
        enable_profile: false -> true
        enable_query_cache: false -> true
        enable_sort_aggregate: false -> true
        enable_spill: false -> true
        query_cache_force_populate: false -> true
        spill_operator_min_bytes: 52428800 -> 10485760
        sql_mode_v2: 32 -> 2097184
        sql_select_limit: 9223372036854775807 -> 20000
stdpain commented 5 months ago

For only group by product_external_id we only need build a hash map with 200 distinct values. but group by a, order by b limit we need collect all input from upstream.

stdpain commented 5 months ago

It's not a bug. closed.

sergeyshaykhullin commented 5 months ago

@stdpain But there is a predicate product_external_id = 123

That means that hash map has only 1 element

And second case should order by single row

stdpain commented 5 months ago

@sergeyshaykhullin can u provide the query profile for then?

sergeyshaykhullin commented 5 months ago

@stdpain Queries and schema are in issue, 3 and 4

stdpain commented 5 months ago

@sergeyshaykhullin can u run with set enable_parallel_merge=false;

stdpain commented 5 months ago

similar with https://github.com/StarRocks/starrocks/pull/35899