USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

[SPARKLER 6] Kafka Connector Data Sink #28

Closed rahulpalamuttam closed 8 years ago

rahulpalamuttam commented 8 years ago

Addresses issue #6

Details about the PR:

To have the PR working : Need to setup kafka and have it listen on default port of 9092 on localhost. You can follow the instructions here : starting from step 2 if you already have java installed. http://www.tutorialspoint.com/apache_kafka/apache_kafka_installation_steps.htm

To verify the dumps, open up a listener on topic "sparkler" in terminal.

When running the crawl the kafka producer outputs its properties and it is quite a large string. Do we want to turn it off? @thammegowda @karanjeets

thammegowda commented 8 years ago

@rahulpalamuttam This is an awesome addition.

I was able to test and get this working. This is amazing, we can now start the crawl somewhere on the cluster and stream out the crawl data to our applications.

Just one small modification is necessary before the merge: Currently, the kafka topic used is sparkler. Could you update the code to make use of sparkler/<jobid> as the topic name for publishing the output. Suggestion:You may put sparkler/%s as config value in config file and then at run time, just format %s with jobid (that way users can change it to anything they want).

FYI, to get the job id just call getter method on SparklerJob object.

rahulpalamuttam commented 8 years ago

@thammegowda kafka doesn't allow forward slash. Take a look at line 29 in Kafka source here : https://github.com/apache/kafka/blob/trunk/core/src/main/scala/kafka/common/Topic.scala#L29

How about sparkler_%s. We can't do just %s as a yaml property as yaml doesn't like that.

thammegowda commented 8 years ago

Awesome PR. Thanks @rahulpalamuttam