apache / celeborn

Apache Celeborn is an elastic and high-performance service for shuffle and spilled data.
https://celeborn.apache.org/
Apache License 2.0
881 stars 360 forks source link

[CELEBORN-1568] Support worker retries in MiniCluster #2692

Closed cxzl25 closed 2 months ago

cxzl25 commented 2 months ago

What changes were proposed in this pull request?

Why are the changes needed?

https://github.com/apache/celeborn/actions/runs/10417785546/job/28852691241#step:4:5804

Now the worker retry logic, the first time of sleep is 2000, the second time is 4000000, and the third time is 8000000000 milliseconds. It is estimated that it will be difficult to complete the retry.

Thread.sleep(math.pow(2000, workerStartRetry).toInt)

Does this PR introduce any user-facing change?

No

How was this patch tested?

GA