TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.26k stars 271 forks source link

How to merge data of workers after execution? #275

Open zy1994-lab opened 5 years ago

zy1994-lab commented 5 years ago

I am learning Timely now. I have run it in a distributed environment. After each worker received and processed its data, I want to merge the data among workers in each machine at the end of dataflow excution. Does anyone know how to do it?

comnik commented 5 years ago

You can use the exchange operator (e.g. as described here) to send all outputs to a specific worker.

zy1994-lab commented 5 years ago

You can use the exchange operator (e.g. as described here) to send all outputs to a specific worker.

Thanks for your answer @comnik . I am considering using the exchange operator. But as I am processing the dataflow in batches, I want to cache the results in a vec and exchange them to the target worker after the last round of the dataflow. I didn't find a way to identify the last round of the dataflow. Is it possible to do it in Timely?

frankmcsherry commented 5 years ago

What is arguably more idiomatic is to exchange the data as they arrive (it is a streaming system!) and then to use the probe operator to inform you about when the stream is complete. The receiving worker would then be able to consolidate the results that it receives into a vector, or which ever other data structure would help.

Another possibly helpful example of this is how differential dataflow uses the Arranged structure and its trace (https://youtu.be/83rG471bmw8?t=1730) which if you are using differential dataflow will allow you to exfiltrate the data from the dataflow, and then interactively navigate the results at any point (including at the end of the computation).

I'm guessing it might be helpful for me to write (as a demo, possibly as a library method) a sink operator that perhaps takes a FnOnce<Vec<Data>> which gets called at the end of the computation at each worker, with whichever data ended up there. This probably leaves a bunch of performance on the floor (e.g. processing data as it arrives) but at least would give an example of how to write this.

zy1994-lab commented 5 years ago

Hi Frank, a sink operator will definitely help me (and possibly many other users) a lot. I totally agree with you point that Timely is a streaming system. But due to its high performance, I am using it to do some analytics tasks used to be done using frameworks like Spark. I have countered multiple examples in analytics tasks that you need to perform some operation at the end of the computation at each worker. Currently I have put such operations in the Drop trait for some pre-defined structure, which are moved into the dataflow, to let it execute in the end. However, it doesn't always fit my requirements and is very hard to write/read. So adding a sink operation is very good in my opinion!