Closed rahulpalamuttam closed 8 years ago
@rahulpalamuttam This is an awesome addition.
I was able to test and get this working. This is amazing, we can now start the crawl somewhere on the cluster and stream out the crawl data to our applications.
Just one small modification is necessary before the merge:
Currently, the kafka topic used is sparkler
.
Could you update the code to make use of sparkler/<jobid>
as the topic name for publishing the output.
Suggestion:You may put sparkler/%s
as config value in config file and then at run time, just format %s
with jobid
(that way users can change it to anything they want).
FYI, to get the job id just call getter method on SparklerJob
object.
@thammegowda kafka doesn't allow forward slash. Take a look at line 29 in Kafka source here : https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/common/Topic.scala#L29
How about sparkler_%s. We can't do just %s as a yaml property as yaml doesn't like that.
Awesome PR. Thanks @rahulpalamuttam
Addresses issue #6
Details about the PR:
To have the PR working : Need to setup kafka and have it listen on default port of 9092 on localhost. You can follow the instructions here : starting from step 2 if you already have java installed. http://www.tutorialspoint.com/apache_kafka/apache_kafka_installation_steps.htm
To verify the dumps, open up a listener on topic "sparkler" in terminal.
When running the crawl the kafka producer outputs its properties and it is quite a large string. Do we want to turn it off? @thammegowda @karanjeets