So far the policy is to detect whether one worker is over threshold three times. Guagua will kill worker and make it run in another machine.
In some cases it does not work well in a busy Hadoop cluster. Some times a worker is very slow but never over threshold which cause bad performance.
Consider this policy:
In each iteration, master receives all running time of workers, if the running time is over std, should be better than original policy.
Found one case, 442 workers are all about 10s, 1 worker is about 30s for computation time, threashold is set to 40s, this time one straggler is never found
So far the policy is to detect whether one worker is over threshold three times. Guagua will kill worker and make it run in another machine.
In some cases it does not work well in a busy Hadoop cluster. Some times a worker is very slow but never over threshold which cause bad performance.
Consider this policy: In each iteration, master receives all running time of workers, if the running time is over std, should be better than original policy.