Closed codenohup closed 2 weeks ago
The ReadBufferDispatcher may encounter an exception when a Netty OutOfDirectMemoryError occurs. In this case, we should allow the map partition reader to retry; otherwise, the Flink Task Manager could hang. cc @SteNicholas @mridulm
@codenohup, thanks for contribution. Merged to main(v0.6.0) and branch-0.5(v0.5.2).
What changes were proposed in this pull request?
When the ReadBufferDispatcher encounters an exception, it should notify an exception to listener. The listener is responsible for informing the Celeborn client of the error and initiating some fault tolerance strategies.
Why are the changes needed?
If the ReadBufferDispatcher don't notify the listener of an exception message, it may result in the listener (MapPartitionDataReader) being stuck in a prolonged wait state, ultimately leading to the job hanging.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Add an unit test case.