apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
862 stars 351 forks source link

[CELEBORN-1558] Fix the incorrect decrement of pendingWrites in handlePushMergeData #2677

Closed RexXiong closed 1 month ago

RexXiong commented 1 month ago

What changes were proposed in this pull request?

  1. Fix the incorrect decrement of pendingWrites for FileWriter
  2. Improve some logs about hardSplit/ExceptionLogs

Why are the changes needed?

There are multiple file writers that write data in handlePushMergeData. If the previous FileWriter has already been closed, the next decrementPendingWrites will use an incorrect FileWriter. And this will cause timeout when commitFiles.

java.io.IOException: Wait pending actions timeout, counter 1 at org.apache.celeborn.service.deploy.worker.storage.PartitionDataWriter.waitOnNoPending(PartitionDataWriter.java)

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GA & manual test

mridulm commented 1 month ago

+CC @akpatnam25