DOCS - what is the mechanism behind mist workers fault tolerance?

sayak-ghosh1990 commented 6 years ago

The driver could be recovered from failure if we could submit the job as deploy-mode cluster using checkpoint mechanism.But using MIST, is it possible to recover driver process even if we create the job in client mode to the MIST with worker-runner mode local and in this kind of scenario how does MIST server react?

dos65 commented 6 years ago

Sorry for the late reply. I'll try to cover this question here and then update documentation.

Actually, checkpoints usage don't provide the fault-tolerance mechanism. In case of job failure they help to re-run job and reuse previous computations to speed up its execution.

Mist treats every function execution as a job. Before starting job execution mist master queues them and waits when there will be a free worker for a context where the job should be executed. There are two worker-modes for contexts: exclusive and shared. For exclusive worker-mode mist starts a new worker per every job and shuts it down after its completion. For shared mist doesn't shutdown workers right after job completion, it reuses them to execute next jobs that should be executed in the same context. The fault-tolerance mechanism here that if a worker was crushed mists tries to restart it to invoke the next job. If workers were crushed more than maxConnFailures (context option) in sequence this context marks as broken and fails all jobs until it will be updated.

sayak-ghosh1990 commented 6 years ago

Thanks for ur detailed response.

Hydrospheredata / mist

DOCS - what is the mechanism behind mist workers fault tolerance? #509