apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.15k stars 416 forks source link

[CH] AQE seems not working on skewed partitions #7114

Open zhanglistar opened 2 weeks ago

zhanglistar commented 2 weeks ago

Description

SQL: d_1075_0.sql The data itself is skewed on column type about 65% values on one type. Vanila works fine, 45s max processing time. But gluten with CH backend seems not working about 1.76 hour the biggest task. vanila:

image

gluten:

image
lgbo-ustc commented 2 weeks ago

There are two adaptive rules to handle partitions skew, OptimizeSkewedJoin and OptimizeSkewInRebalancePartitions.

OptimizeSkewedJoin only works on SortMergeJoinExec and ShuffledHashJoinExec,(Why does broadcast hash join not work?). After disable broadcast join, the partition skew disappeare. Partitions are more balance. image

OptimizeSkewedJoin cannot be applied in this case, we found the join algorithm is broadcast hash join. Could it be shuffle hash join in vanilla ? I don't think so, the final plan says it's broadcast hash join too.

OptimizeSkewInRebalancePartitions doesn't work, because the shuffle.shuffleOrigin is ENSURE_REQUIREMENTS. It only supports ENSURE_REQUIREMENTS and ENSURE_REQUIREMENTS in OptimizeSkewInRebalancePartitions.

image image

24/09/05 13:09:35.153 ERROR [main] DebugOptimizeSkewInRebalancePartitions: xxx ColumnarExchange hashpartitioning(type#51L, 1000), ENSURE_REQUIREMENTS, [plan_id=109], [shuffle_writer_type=hash], [OUTPUT] List(appid:LongType, type:LongType, info:MapType(StringType,StringType,true)), [OUTPUT] List(appid:LongType, type:LongType, info:MapType(StringType,StringType,true))
zhanglistar commented 2 weeks ago

It seems that the AQE not right, so AQE shrinks the task count from 900+ to 20+.

wForget commented 2 weeks ago

OptimizeSkewedJoin only works on SortMergeJoinExec and ShuffledHashJoinExec,(Why does broadcast hash join not work?).

The OptimizeShuffleWithLocalRead rule in AQE is used to optimize BroadcastHashJoin to avoid skew.

lgbo-ustc commented 2 weeks ago

Thanks @wForget , I think the skew is caused by disabling this rule. Because we use rss which doesn't support local read.

wForget commented 2 weeks ago

Thanks @wForget , I think the skew is caused by disabling this rule. Because we use rss which doesn't support local read.

This problem does exist in the RSS scenario. I tried to fix it in Spark https://github.com/apache/spark/pull/41609, but unfortunately it was not merged into upstream

lgbo-ustc commented 2 weeks ago

Thanks @wForget , I think the skew is caused by disabling this rule. Because we use rss which doesn't support local read.

This problem does exist in the RSS scenario. I tried to fix it in Spark apache/spark#41609, but unfortunately it was not merged into upstream

It's strange, when we enable spark.sql.adaptive.localShuffleReader.enabled, the query runs OK and fast, and the skew disappeares.

lgbo-ustc commented 2 weeks ago

@wForget In what scenario, rss would bring performance drawback? I wonder whether we could use local shuffle reader in some specific scenarios

wForget commented 2 weeks ago

When using RSS, localShuffleReader should not be enabled in most cases, as this may cause random reads again.

zhanglistar commented 2 weeks ago

For now, we just disable broadcast join to not triger this problem, this is not gluten problem.