Implement data refreshing driver

andyhehk commented 11 years ago

We need data refreshing driver for continuous workflow. Tentatively, we can use Kafka as the data stream producer.

Each system need its own implementation to consume the streaming data.

mayureshkunjir commented 11 years ago

Kafka will be a good choice. But we can also write new appends to a new hdfs file if that is easier. Spark, for example, has an ability to 'watch' a hdfs directory for new additions periodically which enables it to run the registered continuous queries with input from newly added files. Of course, it can consume data from Kafka as well.

andyhehk commented 11 years ago

I choose Kafka for simplicity: (1) It is easier to do distributed message streaming. (2) We can easily customize the behavior of different consumers (Hadoop, Spark, Vertica, HANA etc.) Not all engines use HDFS to store data.

Of course, there may be other options better than Kafka. We can also let this part to be extensible.

shivnath commented 11 years ago

We can't go wrong with Kafka. People do use HDFS or, more commonly, NoSQL systems like HBase and Cassandra to absorb high volumes of small writes today; but HDFS is arguably not good for small appends like what we will get with data streams. Use of HDFS in Cyclops was mostly an implementation convenience.

On Fri, Oct 25, 2013 at 7:58 PM, Andy He notifications@github.com wrote:

I choose Kafka for simplicity: (1) It is easier to do distributed message streaming. (2) We can easily customize the behavior of different consumers (Hadoop, Spark, Vertica, HANA etc.) Not all engine use HDFS to store data.

Of course, there may be other options better than Kafka. We can also let this part to be extensible.

— Reply to this email directly or view it on GitHubhttps://github.com/bigframeteam/BigFrame/issues/7#issuecomment-27137937 .

Shivnath Babu Associate Professor Department of Computer Science Duke University shivnath @ cs.duke.edu

andyhehk commented 11 years ago

I just implement a preliminary data refreshing driver using Kafka.

In terms of our continuous workflow, only relational and tweets' data will be updated. The graph data will be static during the workflow's running.

Actually, I have thought about two ways to implement the data driver, but I give up method* (1)* at this moment:

(1) Like the linear road benchmark, we can pre-generate a set of data which simulates a set of events for a long period, then replay these events at the run-time.

For example, in our workflow, we can simulate a period, let's say 3 hours, where users bought products and post their tweets.

Linear road use a L-factor ( L means the number of highways ) to control the scale factor. We can propose a similar one called C-factor ( C means the number of cities, each city has the same number of citizens, products bought per second and tweets sent per second.

The disadvantages of this method are: (a) For each C-factor, we need to pre-generate its corresponding data set: A twitter graph contain all people in* C* cities. The off-line events of shopping and sending tweets. This is not so flexible, especially when we want to test how large the C-factora system can handle at most. (b) Another problem is about the implementation, I haven't find a good way to stream out the data in parallel (Mainly cause by the difficulty to replay the events in the same order as they are generated). Using a single process is easy to control the order of event to be sent, but the performance is not acceptable. In this case, the number of events sent per second will be rather small, because the size of each event is a little bit large, for example, a single tweet is about 1.4KB.

Therefore, I adopt a much simpler way at this moment:

(2) User only need to do the data preparation once, which generates a graph with about 10,000,000 nodes, and a set of tpcds update records.

At the running time, user can launch "A" number of data producers at different machines. Each data producer will generate about 100 sales per second, and 10,000 tweets per seconds. These producers are work independently. User can add more and kill some of them whenever they want.

We measure the performance of a system by how many data producers it can handle. We use a A-factor to represent this result. The meaning of A is the active citizen ratio. The more active the citizens, the more sales and tweets generated per second.

Still, I think* (1) *should make more sense. I can implement it again if I got a good solution.

Any comment?

On Sat, Oct 26, 2013 at 5:18 PM, shivnath notifications@github.com wrote:

We can't go wrong with Kafka. People do use HDFS or, more commonly, NoSQL systems like HBase and Cassandra to absorb high volumes of small writes today; but HDFS is arguably not good for small appends like what we will get with data streams. Use of HDFS in Cyclops was mostly an implementation convenience.

On Fri, Oct 25, 2013 at 7:58 PM, Andy He notifications@github.com wrote:

I choose Kafka for simplicity: (1) It is easier to do distributed message streaming. (2) We can easily customize the behavior of different consumers (Hadoop, Spark, Vertica, HANA etc.) Not all engine use HDFS to store data.

Of course, there may be other options better than Kafka. We can also let this part to be extensible.

— Reply to this email directly or view it on GitHub< https://github.com/bigframeteam/BigFrame/issues/7#issuecomment-27137937> .

Shivnath Babu Associate Professor Department of Computer Science Duke University shivnath @ cs.duke.edu

— Reply to this email directly or view it on GitHubhttps://github.com/bigframeteam/BigFrame/issues/7#issuecomment-27142661 .

~Andy

shivnath commented 11 years ago

I like the basic model of 2; we can always fine tune the abstraction and details. That is the TCP-C/YCSB etc. model. Arguably, stream processing is similar to OLTP.

Andy: Why do you prefer 1 over 2?

Shivnath

On Thu, Oct 31, 2013 at 9:07 PM, Andy He notifications@github.com wrote:

I just implement a preliminary data refreshing driver using Kafka.

In terms of our continuous workflow, only relational and tweets' data will be updated. The graph data will be static during the workflow's running.

Actually, I have thought about two ways to implement the data driver, but I give up method* (1)* at this moment:

(1) Like the linear road benchmark, we can pre-generate a set of data which simulates a set of events for a long period, then replay these events at the run-time.

For example, in our workflow, we can simulate a period, let's say 3 hours, where users bought products and post their tweets.

Linear road use a L-factor ( L means the number of highways ) to control the scale factor. We can propose a similar one called C-factor ( C means the number of cities, each city has the same number of citizens, products bought per second and tweets sent per second.

The disadvantages of this method are: (a) For each C-factor, we need to pre-generate its corresponding data set: A twitter graph contain all people in* C* cities. The off-line events of shopping and sending tweets. This is not so flexible, especially when we want to test how large the C-factora system can handle at most. (b) Another problem is about the implementation, I haven't find a good way to stream out the data in parallel (Mainly cause by the difficulty to replay the events in the same order as they are generated). Using a single process is easy to control the order of event to be sent, but the performance is not acceptable. In this case, the number of events sent per second will be rather small, because the size of each event is a little bit large, for example, a single tweet is about 1.4KB.

Therefore, I adopt a much simpler way at this moment:

(2) User only need to do the data preparation once, which generates a graph with about 10,000,000 nodes, and a set of tpcds update records.

At the running time, user can launch "A" number of data producers at different machines. Each data producer will generate about 100 sales per second, and 10,000 tweets per seconds. These producers are work independently. User can add more and kill some of them whenever they want.

We measure the performance of a system by how many data producers it can handle. We use a A-factor to represent this result. The meaning of A is the active citizen ratio. The more active the citizens, the more sales and tweets generated per second.

Still, I think* (1) *should make more sense. I can implement it again if I got a good solution.

Any comment?

On Sat, Oct 26, 2013 at 5:18 PM, shivnath notifications@github.com wrote:

We can't go wrong with Kafka. People do use HDFS or, more commonly, NoSQL systems like HBase and Cassandra to absorb high volumes of small writes today; but HDFS is arguably not good for small appends like what we will get with data streams. Use of HDFS in Cyclops was mostly an implementation convenience.

On Fri, Oct 25, 2013 at 7:58 PM, Andy He notifications@github.com wrote:

I choose Kafka for simplicity: (1) It is easier to do distributed message streaming. (2) We can easily customize the behavior of different consumers (Hadoop, Spark, Vertica, HANA etc.) Not all engine use HDFS to store data.

Of course, there may be other options better than Kafka. We can also let this part to be extensible.

— Reply to this email directly or view it on GitHub< https://github.com/bigframeteam/BigFrame/issues/7#issuecomment-27137937>

.

Shivnath Babu Associate Professor Department of Computer Science Duke University shivnath @ cs.duke.edu

— Reply to this email directly or view it on GitHub< https://github.com/bigframeteam/BigFrame/issues/7#issuecomment-27142661> .

~Andy

— Reply to this email directly or view it on GitHubhttps://github.com/bigframeteam/BigFrame/issues/7#issuecomment-27546707 .

Shivnath Babu Associate Professor Department of Computer Science Duke University shivnath @ cs.duke.edu

bigframeteam / BigFrame

Implement data refreshing driver #7