TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.25k stars 273 forks source link

[Docs] Clarify failure modes of multiple workers and clusters #319

Open tarqd opened 4 years ago

tarqd commented 4 years ago

After reading through the docs I'm a little unclear on what happens should a worker fail (process/thread dies, network partition, etc).

Main concerns:

If this is documented and I've simply missed it, I apologize in advance :)

frankmcsherry commented 4 years ago

The intended behavior is that everything panics. This has not always been the case, but there is code in place to intentionally propagate panics across worker and communication threads, and across the network. If one worker crashing leaves the system in a "hung" state that's a bug.

There is no story for recovering partial data from live workers. It's a legit thing to think about.

The general "philosophy" has been that fault-tolerance and failure recovery is for a higher-level framework to support (one that has more opinions on what constitutes "correct behavior"). The analogy here is that your OS is some degree less "fault tolerant" than the DB that lives on top of it, because the former is less clear on what it should do when something goes wrong. Same thing here (differential dataflow would be the more sane place to talk about building in fault tolerance, and recovery of state).

I'll leave the issue open because I'm guessing the docs don't reflect these expectations.

tarqd commented 4 years ago

Thank you for clarifying! Completely agree that fault tolerance is outside the scope of this library, arguably it'd be pretty hard to put in differential-dataflow even (sane behavior is very dependent on your dataflows, I'd imagine).

Can you override the "panic!" behavior from a higher level library? With that combined with a custom exchange operator, it seems like you can implement anything you want.

I can imagine for certain use-cases a world where workers could implement a full / incremental state transfer similar to what Galera does when a node is rejoining the cluster.

For example, imagine dataflows implementing a distributed policy/decision engine. Workers could advertise being having the ability to evaluate a subset of policies.

With some (read: a lot) of effort, you could create a system that allows you to dynamically add workers to the cluster and replicate the state required to evaluate policies based on load. Perhaps dynamically scheduling/replicating policy data to nodes as you go.