discoproject / disco

a Map/Reduce framework for distributed computing
http://discoproject.org
BSD 3-Clause "New" or "Revised" License
1.63k stars 241 forks source link

Add support for streaming #579

Open pooya opened 10 years ago

pooya commented 10 years ago

This issue will track the progress of Disco Streaming.

Add support for streaming data into disco jobs. Currently, when a Disco job starts, it knows all of the inputs and schedules tasks based on these inputs. An alternative, would be to accept a stream of inputs and create new tasks as the inputs become ready.

The idea has been tackled by Spark before: http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf

There are a lot of projects (Kafka, Flume, Storm) that excel in streaming data from one location to another. The best outcome of this project would be to integrate one such streaming tool. However, a simpler input might be using disco distributed file system as an intermediate storage; input is pushed into DDFS and can be monitored and a new task will be created as soon as there is "enough" input available.

reevapp commented 9 years ago

That sounds something like Inferno: https://github.com/chango/inferno A rule can check for files by tag and run the MapReduce Job when enough is accumulated.

pooya commented 9 years ago

Yes. But inferno has nothing to offer for ETL and also it does not use the modern pipelines available in disco 0.5+