apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

chore: Reserve memory for native shuffle writer per partition #1022

Closed viirya closed 1 month ago

viirya commented 1 month ago

Which issue does this PR close?

Closes #1019.

Rationale for this change

This restore the patch merged in #988. The patch causes the issue #1019. This patch includes a fix for that.

What changes are included in this PR?

How are these changes tested?

Manually run TPCH benchmark locally.

andygrove commented 1 month ago

I am testing this PR out now with benchmarks.

andygrove commented 1 month ago

I am testing with TPC-H sf=100. I usually test with one executor and 8 cores, but with this PR I can only run with a single core. I tried with 2 cores with this config:

    --conf spark.executor.instances=1 \
    --conf spark.executor.memory=16G \
    --conf spark.executor.cores=2 \
    --conf spark.cores.max=2 \
    --conf spark.memory.offHeap.enabled=true \
    --conf spark.memory.offHeap.size=20g \

The job fails with:

org.apache.spark.SparkException: 
  Job aborted due to stage failure: Task 0 in stage 251.0 failed 4 times, most recent failure: 
  Lost task 0.3 in stage 251.0 (TID 2171) (10.0.0.118 executor 0): 
  org.apache.comet.CometNativeException: 
  External error: 
  Internal error: Partition is still not able to allocate enough memory for the array builders after spilling..
viirya commented 1 month ago

I will try it with sf=100.

codecov-commenter commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 34.43%. Comparing base (591f45a) to head (fd78a74). Report is 3 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1022 +/- ## ============================================ + Coverage 34.30% 34.43% +0.13% - Complexity 887 898 +11 ============================================ Files 112 112 Lines 43429 43538 +109 Branches 9623 9660 +37 ============================================ + Hits 14897 14994 +97 - Misses 25473 25479 +6 - Partials 3059 3065 +6 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

viirya commented 1 month ago

Thanks @andygrove