apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.16k stars 420 forks source link

[VL] Infer columnarShuffleSortPartitionsThreshold based on offheap memory #7318

Open wForget opened 1 week ago

wForget commented 1 week ago

Description

The default spark.gluten.sql.columnar.shuffle.sort.partitions.threshold value is too large that when we increase the number of partitions, hash-base shuffle is used and may cause OOM.

https://github.com/apache/incubator-gluten/blob/55671cbf7c8bcd945b97869110872bf949589438/shims/common/src/main/scala/org/apache/gluten/GlutenConfig.scala#L1007-L1013

Error:

Job aborted due to stage failure: Task 603 in stage 6.0 failed 4 times, most recent failure: Lost task 603.3 in stage 6.0 (TID 14913) (node71-129-11-bdxs.qiyi.hadoop executor 487): ExecutorLostFailure (executor 487 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding physical memory limits. 4.7 GB of 4 GB physical memory used. Consider boosting spark.executor.memoryOverhead.
FelixYBW commented 6 days ago

We actually set the value as 4000 or 8 columns