ShifuML / shifu

An end-to-end machine learning and data mining framework on Hadoop
https://github.com/ShifuML/shifu/wiki
Apache License 2.0
251 stars 109 forks source link

Tensorflow: Chief Worker Fault Tolerance #608

Open zhangpengshan opened 5 years ago

zhangpengshan commented 5 years ago

Current chief worker is woker task 0, if chief worker is failed, then the job should be retriggered, this is single point failure.

To improve it by using ZooKeeper, if chief is failed, selected another one as chief worker. Check this video: https://www.youtube.com/watch?v=la_M6bCV91M