apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.22k stars 437 forks source link

[CH] There is performance regression with lazy expand enable when there are high cardinality grouping keys #7986

Closed lgbo-ustc closed 2 days ago

lgbo-ustc commented 2 days ago

Backend

CH (ClickHouse)

Bug description

Lazy expand(#7647) could work will on low cardinality group keys, but there is performance regression when there are high cardinality grouping keys. For example, tpcds q67 runs slower, since the i_product_name is a high cardinality column. It's a problem we have discuss in #7647, let's see what we could do to improve this cases.

image

Obviously, with lazy expand enable, this first aggregate stage generates more rows. This increase the execution time of shuffle and the second aggregate stage. We use a simple algorithm to decide whether aggregate data from the expand operator, this cause there is no rows are aggregated from the expand operator.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

lgbo-ustc commented 2 days ago

The time of sink increases significantly.

lgbo-ustc commented 2 days ago

q67 is a special case. The input and output rows of aggregation with full grouping keys set are almost the same. This the execution time almost the same for enable and disable lazy expand. Because the lookups and insertions of hash table are equal. image

Enable lazy expand can only improve performance when the output rows number is less than the input rows number in the aggregation.

lgbo-ustc commented 2 days ago

About the performance degradation in sink, I guess it's related to the blocks number growth. The total input rows number are equal, but they are stored into more blocks with lazy expand enable.