apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
862 stars 351 forks source link

[CELEBORN-1580] ReadBufferDispacther should notify exception to listener #2707

Closed codenohup closed 2 weeks ago

codenohup commented 2 weeks ago

What changes were proposed in this pull request?

When the ReadBufferDispatcher encounters an exception, it should notify an exception to listener. The listener is responsible for informing the Celeborn client of the error and initiating some fault tolerance strategies.

Why are the changes needed?

If the ReadBufferDispatcher don't notify the listener of an exception message, it may result in the listener (MapPartitionDataReader) being stuck in a prolonged wait state, ultimately leading to the job hanging.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Add an unit test case.

RexXiong commented 2 weeks ago

The ReadBufferDispatcher may encounter an exception when a Netty OutOfDirectMemoryError occurs. In this case, we should allow the map partition reader to retry; otherwise, the Flink Task Manager could hang. cc @SteNicholas @mridulm

SteNicholas commented 2 weeks ago

@codenohup, thanks for contribution. Merged to main(v0.6.0) and branch-0.5(v0.5.2).