USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Guide for sparkler and hdfs #25

Closed MuhammadTalhaAfzal closed 7 years ago

MuhammadTalhaAfzal commented 8 years ago

In the guide there is nothing on how to connect hdfs with sparkler. In notes for Apache Nutch Users and developers.

Note 2: Crawled content Sparkler can produce the segments on HDFS, trying to keep it compatible with nutch content format.

Please share the steps. How?

karanjeets commented 8 years ago

@MuhammadTalhaAfzal:

The capability already exists in Sparkler. If you see Crawler.scala Line-174, we are saving the crawled content on HDFS in Nutch format.

Hope it answers your question.

MuhammadTalhaAfzal commented 8 years ago

@karanjeets:

Right now content files (nutch segments) are creating in sparkler-master/sparkler-app/target/ directory but i want it to be in hdfs.

I have hadoop 2.5.2, already installed. So in order to store content in hdfs i am using this command when crawling. java -jar sparkler-app-0.1.jar crawl -id sparkler-job-xxxxxxxx -o hdfs://localhost:9000/sparklercontent

but i am getting this error -> No FileSystem for schema: hdfs

I have also tried to add core-site.xml of hadoop in jar files but i am still getting the same error when i run below command java -jar sparkler-app-0.1.jar crawl -id sparkler-job-xxxxxxxx

I've also added dependency in sparkler-master/sparkler-app/pom.xml for maven: `

org.apache.hadoop
<artifactId>hadoop-hdfs</artifactId>
<version>2.2.0</version>

`

Now i am out of thoughts. Could somebody please post the steps of how to run sparkler with hdfs.

karanjeets commented 8 years ago

@MuhammadTalhaAfzal: Sparkler currently picks up hadoop from Spark library which uses Hadoop 2.2.0. Now if hadoop 2.5.2 has the backward compatibility to hadoop 2.2.0, it should work fine.

Quick Tip I see a lot of places you are using java. Instead of this, you should run this as a spark job (using spark-submit) and make sure it is aware of HDFS. I am quite sure that this will solve your problem.

I am preparing a step by step guide for running Sparkler on HDFS. I am on travel this week and it might get a little late.

@thammegowda: In case you have time this week, see what is going wrong here.

thammegowda commented 8 years ago

@MuhammadTalhaAfzal Looks like you are running in local mode. What's your spark master URL? Is it local[*] ?

By default Spark saves its output to the file system it is aware of. if you are running in local mode, it uses local file system if you have spark running over YARN with HDFS, then it will use HDFS. In this case, you either submit job using spark-submit OR set master url properly.

If you are not sure what file system it is using, try using relative paths for output and see where it stores.

thammegowda commented 7 years ago

Refer to deployment with juju wiki (Work-in-progress) https://github.com/USCDataScience/sparkler/wiki/Deploying-Sparkler-with-Juju