TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.29k stars 272 forks source link

How do you minimize/optimize data movement? #160

Open JosephWagner opened 6 years ago

JosephWagner commented 6 years ago

I'm not sure best how to phrase this question, so let me start with a concrete example:

Let's say I have a timely dataflow computation spread over 2 machines. I have records uniquely identified by (foo, bar, baz) tuples. In the dataflow graph, I first use exchange to group by foo and bar together, and then later I need to group by just bar. Ideally I'd like to minimize data movement across machines. So if I'm grouping by foo and bar, ideally every unique bar ends up on one of the 2 machines so that grouping by bar later requires no movement across machines.

Does that make sense? Is that type of optimization supported?

frankmcsherry commented 6 years ago

You are able to use exchange to place the data however you like, which sounds like it will probably be by bar. With the data distributed by bar, you now have the guarantee that (foo, bar) keys are also at the same workers, and so operations that need to group by (foo, bar) will be correct if executed without further data exchange.

Now, there's nothing in timely that protects you from first exchanging by bar, then calling some method that re-exchanges things by (foo, bar); perhaps you actually want to re-shuffle the data to balance the load; we don't know and aren't going to prevent it. So to get this to do the thing you want, you'll need to be careful to avoid re-shuffling the data, and that might be tricky if you are hoping to rely on certain pre-fab timely operators; they generally do not have variants that say "trust me; just run w/o any data exchange".