Closed viirya closed 1 month ago
I've copied the tests on my branch to this PR and the test hangs:
running 6 tests
test execution::datafusion::shuffle_writer::test::test_slot_size ... ok
test execution::datafusion::shuffle_writer::test::test_pmod ... ok
test execution::datafusion::shuffle_writer::test::test_insert_larger_batch ... ok
test execution::datafusion::shuffle_writer::test::test_insert_smaller_batch ... ok
test execution::datafusion::shuffle_writer::test::test_large_number_of_partitions has been running for over 60 seconds
test execution::datafusion::shuffle_writer::test::test_large_number_of_partitions_spilling has been running for over 60 seconds
^C
It is possibly caused by deadlocking on buffered_partitions.lock()
when spilling is triggered.
Thanks. I knew the cause of the deadlocks. I'm going to revamp some codes.
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 33.97%. Comparing base (
c3023c5
) to head (e678cb0
). Report is 25 commits behind head on main.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Hmm, these tests for large partition number shuffle fail on MacOS runners only. And no stack trace...But I cannot reproduce it locally.
Okay, it is the error I expected before:
ret: Err(ArrowError(ExternalError(IoError(Custom { kind: Uncategorized, error: PathError { path: "/var/folders/t_/mmhnh941511_hp2lwh383bp00000gn/T/.tmpQv8o2b/.tmpioYozN", err: Os { code: 24, kind: Uncategorized, message: "Too many open files" } } })), None))
But I had increase it by ulimit
. It doesn't help.
I'm testing this PR out now, in conjunction with some other PRs because I currently have a reproducible deadlock caused by memory pool issues, as far as I can tell.
Thanks @andygrove @Kontinuation
Which issue does this PR close?
Closes #887.
Rationale for this change
What changes are included in this PR?
How are these changes tested?