TimelyDataflow / timely-dataflow

A modular implementation of timely dataflow in Rust
MIT License
3.26k stars 271 forks source link

Distributed merge sort example #41

Open rw opened 7 years ago

rw commented 7 years ago

I'm having fun learning timely-dataflow. To do so, I've mainly been reading @frankmcsherry's blog posts and essays.

Repeatedly, I've found myself wanting to know how a distributed merge sort would work in timely-dataflow. In particular, I want to know how to say, "keep reading until RAM is almost full, then sort the batch and emit the tuples". Are there any examples of this?

byronyi commented 7 years ago

Check out external sorting.

rw commented 7 years ago

@byronyi I know what external sorting is, that's how I knew to make this issue...

I'm asking for an example in timely-dataflow.

frankmcsherry commented 7 years ago

Hello, sorry for the delayed response (I'm on vacation at the moment).

Timely itself doesn't expose information about how much memory is still available, though you may be able to hook parts of whichever OS you are using. You should be able to write something that has the general structure "read input (from e.g. disk) until condition met, sort + send data, repeat", though you'll also need to write the "receive data" logic too, as all of the parts of the dataflow run concurrently. Unlike other systems, there isn't a built in unlimited-sized buffer backed by disk (or rather, it is called "virtual memory" here); this is good for low-latency streaming, and bad for "batch abstraction where I don't want to think about buffering".

I can try and get you an example, but for the next few weeks I'm not consistently online (as you have seen). I apologize for the unavailability at a time when it would be most helpful. Let me try to write something up over the next few days and post it back!