apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

feat: Add a `spark.comet.exec.memoryPool` configuration for experimenting with various datafusion memory pool setups. #1021

Open Kontinuation opened 1 month ago

Kontinuation commented 1 month ago

Which issue does this PR close?

This issue relates to https://github.com/apache/datafusion-comet/issues/996 and https://github.com/apache/datafusion-comet/pull/1004

Rationale for this change

This is for investigating various approaches to simplify memory-related configuration and reduce the memory required to run large queries. @andygrove

What changes are included in this PR?

This PR adds a spark.comet.exec.memoryPool configuration for easily running queries using various memory pool setups.

How are these changes tested?

TODO: add tests running in native memory management mode.

codecov-commenter commented 1 month ago

Codecov Report

Attention: Patch coverage is 77.27273% with 5 lines in your changes missing coverage. Please review.

Project coverage is 34.27%. Comparing base (591f45a) to head (bd5a0c7). Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...ain/scala/org/apache/comet/CometExecIterator.scala 73.68% 2 Missing and 3 partials :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #1021 +/- ## ============================================ - Coverage 34.30% 34.27% -0.03% - Complexity 887 889 +2 ============================================ Files 112 112 Lines 43429 43502 +73 Branches 9623 9615 -8 ============================================ + Hits 14897 14912 +15 - Misses 25473 25542 +69 + Partials 3059 3048 -11 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

andygrove commented 1 month ago

Thanks for filing this @Kontinuation. I will close my PRs.

Also, there is a suggestion in https://github.com/apache/datafusion-comet/issues/1017 for always using unified memory.

Kontinuation commented 1 month ago

Thanks for filing this @Kontinuation. I will close my PRs.

Also, there is a suggestion in #1017 for always using unified memory.

I agree that using the unified memory manager is a better approach. Vanilla Spark operators and comet operators are governed by the same memory manager and they are all using offheap memory. Vanilla Spark operators can free some memory when comet operators are under pressure. I'll also put more work into improving the unified memory management.

I think the native memory management approach may still be relevant when users don't want vanilla Spark to use off-heap memory. We can set the default value of spark.comet.exec.memoryPool as greedy_task_shared to make comet memory overhead configuration more intuitive. If users want memory-intensive operators to spill properly, they can try out fair_spill_task_shared.