Open turboFei opened 1 week ago
Seems difficult to add UT, how do you think about? @FMX
It is too difficult to add the UT.
Gentle ping @mridulm
Seems difficult to add UT, how do you think about? @FMX
Hi, I see this PR. IMO, you can add a test config to trigger task hang and fetch failure in certain map tasks. Maybe it won't be too difficult to add UTs.
R. IMO, you can add a test config to trigger task hang and fetch failure in certain map tasks. Maybe it won't be too difficult to add UTs.
Thanks, added UT and tested locally.
The UT is invalid, checking
For local master, it would not start the speculationScheduler.
and it is also not allowed to launch speculative task on the same host.
@FMX
I have to give up the UT for speculation ...
And only add UT for SparkUtils.
What changes were proposed in this pull request?
Prevent stage re-run if another task attempt is running.
If a shuffle read task can not read the shuffle data and the task another attempt is running or successful, just throw the CelebornIOException instead of FetchFailureException.
The app will not failure before reach the task maxFailures.
Why are the changes needed?
I met below issue because I set the wrong parameters, I should set
spark.celeborn.data.io.connectTime=30s
but set thespark.celeborn.data.io.connectionTime=30s
, and the Disk IO Utils was high at that time.Due the stage re-run is heavy, so I wonder that, we should ignore the shuffle fetch failure, if there is another task attempt running.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
UT for the SparkUtils method only, due it is impossible to add UT for speculation.
https://github.com/apache/spark/blob/d5da49d56d7dec5f8a96c5252384d865f7efd4d9/core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala#L236-L244
For local master, it would not start the speculationScheduler.
https://github.com/apache/spark/blob/d5da49d56d7dec5f8a96c5252384d865f7efd4d9/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L322-L346
and it is also not allowed to launch speculative task on the same host.