apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
862 stars 351 forks source link

[CELEBORN-1567] Support throw FetchFailedException when Data corruption detected #2691

Closed cxzl25 closed 3 weeks ago

cxzl25 commented 4 weeks ago

What changes were proposed in this pull request?

Why are the changes needed?

https://github.com/apache/celeborn/pull/2655#pullrequestreview-2213124224

Does this PR introduce any user-facing change?

No

How was this patch tested?

GA

cfmcgrady commented 3 weeks ago

I have a question: should we retry fetching another replication before throwing a FetchFailedException when the conf celeborn.client.push.replicate.enabled is set to true?

cxzl25 commented 3 weeks ago

I have a question: should we retry fetching another replication before throwing a FetchFailedException when the conf celeborn.client.push.replicate.enabled is set to true?

This is not necessarily safe, because the Task may have read part of the data, so it is safer to retry the Task. This is how Spark handles it.

cxzl25 commented 3 weeks ago

support for checksum/validation of data would a good feature

It looks like we've already done this.

https://github.com/apache/celeborn/blob/0ee3c3a4bdef3b97fca6be04f03ac36c2e12e1a6/client/src/main/java/org/apache/celeborn/client/compress/ZstdDecompressor.java#L69-L72

https://github.com/apache/celeborn/blob/0ee3c3a4bdef3b97fca6be04f03ac36c2e12e1a6/client/src/main/java/org/apache/celeborn/client/compress/Lz4Decompressor.java#L82-L85

cfmcgrady commented 3 weeks ago

Thank you, merging to main(v0.6.0)/branch-0.5(v0.5.2)/branch-0.4(v0.4.3).