Closed lgbo-ustc closed 2 days ago
The time of sink increases significantly.
q67 is a special case. The input and output rows of aggregation with full grouping keys set are almost the same. This the execution time almost the same for enable and disable lazy expand. Because the lookups and insertions of hash table are equal.
Enable lazy expand can only improve performance when the output rows number is less than the input rows number in the aggregation.
About the performance degradation in sink
, I guess it's related to the blocks number growth. The total input rows number are equal, but they are stored into more blocks with lazy expand enable.
Backend
CH (ClickHouse)
Bug description
Lazy expand(#7647) could work will on low cardinality group keys, but there is performance regression when there are high cardinality grouping keys. For example, tpcds q67 runs slower, since the
i_product_name
is a high cardinality column. It's a problem we have discuss in #7647, let's see what we could do to improve this cases.Obviously, with lazy expand enable, this first aggregate stage generates more rows. This increase the execution time of shuffle and the second aggregate stage. We use a simple algorithm to decide whether aggregate data from the expand operator, this cause there is no rows are aggregated from the expand operator.
Spark version
None
Spark configurations
No response
System information
No response
Relevant logs
No response