apache / datafusion-comet

Apache DataFusion Comet Spark Accelerator
https://datafusion.apache.org/comet
Apache License 2.0
823 stars 163 forks source link

chore: Reserve memory for native shuffle writer per partition #988

Closed viirya closed 1 month ago

viirya commented 1 month ago

Which issue does this PR close?

Closes #887.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Kontinuation commented 1 month ago

I've copied the tests on my branch to this PR and the test hangs:

running 6 tests
test execution::datafusion::shuffle_writer::test::test_slot_size ... ok
test execution::datafusion::shuffle_writer::test::test_pmod ... ok
test execution::datafusion::shuffle_writer::test::test_insert_larger_batch ... ok
test execution::datafusion::shuffle_writer::test::test_insert_smaller_batch ... ok
test execution::datafusion::shuffle_writer::test::test_large_number_of_partitions has been running for over 60 seconds
test execution::datafusion::shuffle_writer::test::test_large_number_of_partitions_spilling has been running for over 60 seconds
^C

It is possibly caused by deadlocking on buffered_partitions.lock() when spilling is triggered.

viirya commented 1 month ago

Thanks. I knew the cause of the deadlocks. I'm going to revamp some codes.

codecov-commenter commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 33.97%. Comparing base (c3023c5) to head (e678cb0). Report is 25 commits behind head on main.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #988 +/- ## ============================================ - Coverage 34.03% 33.97% -0.07% + Complexity 875 857 -18 ============================================ Files 112 112 Lines 43289 43426 +137 Branches 9572 9622 +50 ============================================ + Hits 14734 14752 +18 - Misses 25521 25630 +109 - Partials 3034 3044 +10 ``` | [Flag](https://app.codecov.io/gh/apache/datafusion-comet/pull/988/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | Coverage Δ | | |---|---|---| | [](https://app.codecov.io/gh/apache/datafusion-comet/pull/988/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache) | `?` | | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=apache#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

viirya commented 1 month ago

Hmm, these tests for large partition number shuffle fail on MacOS runners only. And no stack trace...But I cannot reproduce it locally.

viirya commented 1 month ago

Okay, it is the error I expected before:

ret: Err(ArrowError(ExternalError(IoError(Custom { kind: Uncategorized, error: PathError { path: "/var/folders/t_/mmhnh941511_hp2lwh383bp00000gn/T/.tmpQv8o2b/.tmpioYozN", err: Os { code: 24, kind: Uncategorized, message: "Too many open files" } } })), None))

But I had increase it by ulimit. It doesn't help.

andygrove commented 1 month ago

I'm testing this PR out now, in conjunction with some other PRs because I currently have a reproducible deadlock caused by memory pool issues, as far as I can tell.

viirya commented 1 month ago

Thanks @andygrove @Kontinuation