ShifuML / guagua

An iterative computing framework for both Hadoop MapReduce and Hadoop YARN.
https://github.com/ShifuML/guagua/wiki
Apache License 2.0
71 stars 40 forks source link

Straggler Mitigation Improvement #48

Open zhangpengshan opened 10 years ago

zhangpengshan commented 10 years ago

So far the policy is to detect whether one worker is over threshold three times. Guagua will kill worker and make it run in another machine.

In some cases it does not work well in a busy Hadoop cluster. Some times a worker is very slow but never over threshold which cause bad performance.

Consider this policy: In each iteration, master receives all running time of workers, if the running time is over std, should be better than original policy.

zhangpengshan commented 9 years ago

Found one case, 442 workers are all about 10s, 1 worker is about 30s for computation time, threashold is set to 40s, this time one straggler is never found