apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
886 stars 359 forks source link

[CELEBORN-1686] Avoid return the same pushTaskQueue #2878

Open cxzl25 opened 1 day ago

cxzl25 commented 1 day ago

What changes were proposed in this pull request?

Why are the changes needed?

The close method of SortBasedShuffleWriter#write will call sendBufferPool.returnPushTaskQueue(dataPusher.getIdleQueue());, but the close method may be interrupted.

After the interruption, SortBasedShuffleWriter#cleanupPusher will be called, and sendBufferPool.returnPushTaskQueue(dataPusher.getIdleQueue()); will also be called.

Since SendBufferPool#pushTaskQueues is a LinkedList, repeated add will store two identical idleQueue, which may cause multiple tasks running in parallel to share the same idleQueue, resulting in inaccurate data.

Does this PR introduce any user-facing change?

How was this patch tested?

Production environment verification

RexXiong commented 1 hour ago

If the close method was interrupted, the sendBufferPool would not call returnPushTaskQueue, so sendBufferPool.returnPushTaskQueue(dataPusher.getIdleQueue()) would not be called twice. Please correct me if I'm wrong