USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Add XML-based Sparkler Configuration #13

Closed karanjeets closed 8 years ago

karanjeets commented 8 years ago

Add Sparkler Configuration through sparkler-default.xml and sparkler-site.xml

karanjeets commented 8 years ago

Completed as per commit 6f2410b1b9cfc1e7ff5f2418f4eb3171915a6b26

thammegowda commented 8 years ago

@karanjeets Reopening the issue because the Configuration has to be java serializable. Hadoop Configuration is not java serializable so we have trouble using it directly in spark.

We may need to have our own implementation. Check out https://github.com/USCDataScience/sparkler/blob/365fc934ad0aad7f2dab1a2334f05bbdf0fd368d/sparkler-app/src/main/scala/edu/usc/irds/sparkler/model/SparklerJob.scala#L47 for some FIXME:

Lets discuss more of this.