cylondata / twister2

A composable framework for fast and scalable data analytics
https://twister2.org
Apache License 2.0
57 stars 32 forks source link

Ahmet/fault tolerance #907

Closed ahmet-uyar closed 4 years ago

ahmet-uyar commented 4 years ago

Fault tolerance works with checkpointing. It works with both zookeeper and without. Though, there is no zookeeper, job master failure is fatal. Faulty workers are restarted. Faults are broadcasted to all healthy workers. They stop execution and return to WorkerManager as soon as possible in the case of failures in the job. Workers try to restart and re-execute 5 times by default (this can be changed by config parameters.) They fail after that. Implemented worker restart determination using ConfigMaps in K8s when no zookeeper is used. Implemented fault notifications in jobs without zookeeper.

things to do:

supunkamburugamuve commented 4 years ago

@ahmet-uyar there is a conflict in the code.