dibbhatt / kafka-spark-consumer

High Performance Kafka Connector for Spark Streaming.Supports Multi Topic Fetch, Kafka Security. Reliable offset management in Zookeeper. No Data-loss. No dependency on HDFS and WAL. In-built PID rate controller. Support Message Handler . Offset Lag checker.
Apache License 2.0
635 stars 318 forks source link

want to create more than 1 application but want single Kafka stream #20

Closed sorabh89 closed 9 years ago

sorabh89 commented 9 years ago

Hi,

I have to use a Kafka stream but for different purposes, so I don't want to use different Kafka consumers for it. Is there a way to achieve it.

Is spark cluster going to help me for this.

Please help, Thanks,

dibbhatt commented 9 years ago

I did not understand the question completely . You mean to say you want to consume Kafka stream but not to process using Spark ? Then you can use the kafka api directly for your purpose .

sorabh89 commented 9 years ago

No, I want to use Spark to consume Kafka stream but I have to use same Kafka stream for multiple projects.

jedisct1 commented 9 years ago

That shouldn't be a problem, as long as your other projects are using a different consumer group.

sorabh89 commented 9 years ago

That is the thing I don't want to use different consumer groups. I want a single consumer that consumes the stream and then duplicates it or anything else. And then I want each of my application to use 1 copy of that stream.

dibbhatt commented 9 years ago

There are few disadvantages doing this . If your different applications consumes from same stream , it will be difficult for you to handle failures, replay , re-process etc . Let sat applications A,B,C process from same channel. Now let say you have modified your business logic in App B and want to replay whole stream . The A and C will again process the same messages.

Let say some failure happened in C and you want to start from offset X again . Your B and A would get again reprocess same messages.

What is the constraint to use individual stream ? Resources ? This receiver can now even run on Single Core for given topic and you can control the parallelism for different stream. if you specified settings that way..Let say Application A need more parallelism , you give more receivers , but B does not need more Receiver ..

sorabh89 commented 9 years ago

Thanks Dibbhatt,

Actually my requirement is to use only the last half an hour's data from stream and most important the data for all the applications should be exactly same. So even if I find an error in application A and I change the offset I want the same change to take place for B and C. That is the reason why I want to use single stream of data and then create 3 stream from it with same data in all 3 streams for application A, B and C.

I'm new to spark, So I'm not sure if I can achieve this using Spark. Please let me know if Spark can help me out of it.

Thanks & Regards,

dibbhatt commented 9 years ago

Well, in same stream , for each RDD generated , you apply different logic for A,B,C on same RDD..

sorabh89 commented 9 years ago

In that case only problem is that if I do some changes for let say application A, I will have to redeploy it that will also interrupt execution of application B and C. I am looking for a solution in which I replicate the stream in let say application A and then the copies of the same stream can be used in application B and C.

dibbhatt commented 9 years ago

You want to launch different StreamingContext from different driver program and still want to consume from same stream ? Not sure if that is possible. I do not see what is the issue having three different DStream via three ReceiverLauncher in three driver application. As all three will consume from same Kafka topic, you process SAME data and you will get your multi tenancy kind of feature that one stream won't impact other

sorabh89 commented 9 years ago

Thanks Dibbhatt,

I also tried to find other options for the some but it seems the most convenient option to do this is to use different streams.

Thanks for your explanation.