Open pooya opened 10 years ago
That sounds something like Inferno: https://github.com/chango/inferno A rule can check for files by tag and run the MapReduce Job when enough is accumulated.
Yes. But inferno has nothing to offer for ETL and also it does not use the modern pipelines available in disco 0.5+
This issue will track the progress of Disco Streaming.
Add support for streaming data into disco jobs. Currently, when a Disco job starts, it knows all of the inputs and schedules tasks based on these inputs. An alternative, would be to accept a stream of inputs and create new tasks as the inputs become ready.
The idea has been tackled by Spark before: http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf
There are a lot of projects (Kafka, Flume, Storm) that excel in streaming data from one location to another. The best outcome of this project would be to integrate one such streaming tool. However, a simpler input might be using disco distributed file system as an intermediate storage; input is pushed into DDFS and can be monitored and a new task will be created as soon as there is "enough" input available.