Tencent / Firestorm

Firestorm is a Remote Shuffle Service, and provides the capability for Apache Spark and Apache Hadoop MapReduce applications to store shuffle data on remote servers
Other
252 stars 72 forks source link

[Improvement] Reduce the recomputation caused by bad node #169

Closed jerqi closed 2 years ago

jerqi commented 2 years ago

What changes were proposed in this pull request?

As we know, when MRAppMaster find the node is a bad node, and the node have execute some map tasks, MRAppMaster will recompute them, but it's not necessary for RSS, Because RSS don't store any shuffle data in those nodes. So we don't trigger any recomputation caused by bad node.

Why are the changes needed?

Reduce the recomputation, and recomputation will cause reduce fail because the loss of event.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manual test in our cluster