USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Plugin for fetching pages using a headless browser #37

Closed smadha closed 8 years ago

smadha commented 8 years ago
karanjeets commented 8 years ago

@smadha Thanks a lot!! Appreciate your hard work :) There was a dependency conflict when running fetcher on top of Apache Spark cluster. It is now fixed.

@thammegowda Everything looks good from my end. Tested it locally & on Spark cluster. Do you want to give it a roll before it's merged?

thammegowda commented 8 years ago

Build failed

[INFO] ------------------------------------------------------------------------
[INFO] Building fetcher-jbrowser 0.1-SNAPSHOT
[INFO] ------------------------------------------------------------------------
...... ((Message trimmed))...
[INFO]
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ fetcher-jbrowser ---
[INFO]
[INFO] --- maven-bundle-plugin:2.5.0:manifest (default-cli) @ fetcher-jbrowser ---
[WARNING] Manifest edu.usc.irds.sparkler.plugin:fetcher-jbrowser:bundle:0.1-SNAPSHOT : Unused Private-Package instructions, no such package(s) on the class path: [!*]
[ERROR] Manifest edu.usc.irds.sparkler.plugin:fetcher-jbrowser:bundle:0.1-SNAPSHOT : Bundle-Activator not found on the bundle class path nor in imports: edu.usc.irds.sparkler.plugin.FetcherJBrowserActivator
[ERROR] Error(s) found in manifest configuration
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] sparkler-parent .................................... SUCCESS [  0.377 s]
[INFO] sparkler-api ....................................... SUCCESS [  4.851 s]
[INFO] sparkler ........................................... SUCCESS [ 45.358 s]
[INFO] sparkler-plugins ................................... SUCCESS [  0.069 s]
[INFO] urlfilter-regex .................................... SUCCESS [  2.329 s]
[INFO] fetcher-jbrowser ................................... FAILURE [  3.998 s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:00 min
[INFO] Finished at: 2016-10-19T14:46:13-07:00
[INFO] Final Memory: 64M/1499M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.felix:maven-bundle-plugin:2.5.0:manifest (default-cli) on project fetcher-jbrowser: Error(s) found in manifest configuration -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException
[ERROR]
[ERROR] After correcting the problems, you can resume the build with the command
[ERROR]   mvn <goals> -rf :fetcher-jbrowser
karanjeets commented 8 years ago

@thammegowda Can you please try building Sparkler with the following:

mvn clean install
thammegowda commented 8 years ago

@karanjeets If i do mvn clean install

and then try to run

14:59 $ bin/sparkler.sh inject -su http://www.isjavascriptenabled.com/
>>jobId = sjob-1476914418319
15:00 $ bin/sparkler.sh crawl -id sjob-1476914418319

I get

16/10/19 15:01:37 INFO FetchFunction$: FETCHING http://www.isjavascriptenabled.com/
16/10/19 15:01:37 INFO PluginService$: Felix Configuration loaded successfully
Bundle Found: org.apache.felix.framework
16/10/19 15:01:37 WARN FetchFunction$: FETCH-ERROR http://www.isjavascriptenabled.com/
java.util.NoSuchElementException: None.get
    at scala.None$.get(Option.scala:347)
    at scala.None$.get(Option.scala:345)
    at edu.usc.irds.sparkler.pipeline.FetchFunction$.apply(FetchFunction.scala:46)
    at edu.usc.irds.sparkler.pipeline.FetchFunction$.apply(FetchFunction.scala:34)
    at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52)
    at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:29)
    at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
16/10/19 15:01:37 INFO ParseFunction$: PARSING  http://www.isjavascriptenabled.com/

Plugin not available/enabled?

I see they are here

15:05 $ ls -al sparkler-app/bundles/
total 66736
drwxr-xr-x  5 thammegr  703763885       170 Oct 19 14:54 .
drwxr-xr-x  6 thammegr  703763885       204 Oct 19 14:58 ..
-rw-r--r--  1 thammegr  703763885         0 Oct 19 13:40 .donotdelete
-rw-r--r--  1 thammegr  703763885  34153743 Oct 19 14:59 fetcher-jbrowser-0.1-SNAPSHOT.jar
-rw-r--r--  1 thammegr  703763885     10543 Oct 19 14:59 urlfilter-regex-0.1-SNAPSHOT.jar

My environment:

15:06 $ java -version
java version "1.8.0_101"
Java(TM) SE Runtime Environment (build 1.8.0_101-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.101-b13, mixed mode)
✔ ~/work/irds/sparkler [js-plugin L|✔]
15:06 $ mvn --version
Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-10T08:41:47-08:00)
Maven home: /usr/local/Cellar/maven/3.3.9/libexec
Java version: 1.8.0_101, vendor: Oracle Corporation
Java home: /Library/Java/JavaVirtualMachines/jdk1.8.0_101.jdk/Contents/Home/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "mac os x", version: "10.11.6", arch: "x86_64", family: "mac"
karanjeets commented 8 years ago

@thammegowda This is strange. It worked for me just fine. Let me try with bin/sparkler.sh and see where things are wrong.

karanjeets commented 8 years ago

@thammegowda The issue is with the bin/sparkler.sh

It is using the conf directory in the build path. As per the new design in Plugin system, the path to bundles directory is generated at compile time using maven resources plugin. See here

Now, as I think, did you add conf directory in build path to pick post build changes? If yes, I can do something similar with the compiled conf directory.

thammegowda commented 8 years ago

@karanjeets I am prepending 'conf' dir to classpath to give higher priority. Is there a way to restore that functionality?

karanjeets commented 8 years ago

@thammegowda Changes have been made. Please review and merge.

thammegowda commented 8 years ago

@karanjeets @smadha merged 💯 . This is a fantastic PR 👍