USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Java null pointer error in fetch() #57

Closed arelaxend closed 7 years ago

arelaxend commented 7 years ago

Hi!

I encounter some errors. The program is crashing for 10 crawls and I have the following errors (i put bold chars). Can you help me to figure out why ?

Best,

1st

2016-12-26 16:40:24 ERROR Executor:95 [Executor task launch worker-1] - Exception in task 3.0 in stage 1.0 (TID 8) org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more

2nd

2016-12-26 16:40:24 ERROR TaskSetManager:74 [task-result-getter-3] - Task 3 in stage 1.0 failed 1 times; aborting job Exception in thread "main" java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at edu.usc.irds.sparkler.Main$.main(Main.scala:47) at edu.usc.irds.sparkler.Main.main(Main.scala) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 8, localhost): org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:139) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:121) at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40) at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:211) at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala) ... 6 more Caused by: org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown' System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more

3rd

ERROR Utils:95 [Executor task launch worker-2] - Uncaught exception in thread Executor task launch worker-2 java.lang.NullPointerException at org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:95) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229) at org.apache.spark.scheduler.Task.run(Task.scala:93) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-12-26 16:40:57 ERROR Executor:95 [Executor task launch worker-2] - Exception in task 1.0 in stage 1.0 (TID 6) java.util.NoSuchElementException: key not found: 6** at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:322) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Exception in thread "Executor task launch worker-2" java.lang.IllegalStateException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Exception in thread "Executor task launch worker-4" java.lang.IllegalStateException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

...

2016-12-26 16:42:35 DEBUG FetcherJBrowser:153 [FelixStartLevel] - Exception Connection refused Build info: version: 'unknown', revision: 'unknown', time: 'unknown' System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver raised. The driver is either already closed or this is an unknown exception

Process finished with exit code 1

arelaxend commented 7 years ago

To solve the issue, just add a throw in public FetchedData fetch(Resource resource) throws Exception, and also add a try/catch in public FetchedData next() around the fetch() function. Cheers

karanjeets commented 7 years ago

@arelaxend Thanks for reporting this issue and also for a quick solution. I will investigate more before putting it around try and catch. Just a suggestion - If you know what URLs are causing this issue and you don't want to crawl them as well, place URL regex filter(s). It will save you some time. :)

thammegowda commented 7 years ago

Thanks @arelaxend If you could submit a pull request with those fixes, it will be awesome :+1: :1st_place_medal:

thammegowda commented 7 years ago

@karanjeets
Instead of

Just a suggestion - If you know what URLs are causing this issue and you don't want to crawl them as well, place URL regex filter(s). It will save you some time. :)

I suggest we do this:

To solve the issue, just add a throw in public FetchedData fetch(Resource resource) throws Exception, and also add a try/catch in public FetchedData next() around the fetch() function.

arelaxend commented 7 years ago

Hi, I got: fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.

Besides, you just have to update the function fetch() in the plugin fetcher-jbbrowser. I choose to do that to be coherent with what you have done in the fetch() function on the app folder. 👍 And so, it means to change the implementation of the fetch() function by the following code:

public FetchedData fetch(Resource resource) {
    LOG.info("JBrowser FETCHER {}", resource.getUrl());
    FetchedData fetchedData;
        /*
        * In this plugin we will work on only HTML data
        * If data is of any other data type like image, pdf etc plugin will return client error
        * so it can be fetched using default Fetcher
        */
    try {
      if (!isWebPage(resource.getUrl())) {
        LOG.debug("{} not a html. Falling back to default fetcher.",
            resource.getUrl());
        //This should be true for all URLS ending with 4 character file extension
        //return new FetchedData("".getBytes(), "application/html", ERROR_CODE) ;
        return super.fetch(resource);
      }
      long start = System.currentTimeMillis();

      LOG.debug("Time taken to create driver- {}",
          (System.currentTimeMillis() - start));

      // This will block for the page load and any
      // associated AJAX requests
      driver.get(resource.getUrl());

      int status = driver.getStatusCode();
      //content-type

      // Returns the page source in its current state, including
      // any DOM updates that occurred after page load
      String html = driver.getPageSource();

      //quitBrowserInstance(driver);

      LOG.debug("Time taken to load {} - {} ", resource.getUrl(),
          (System.currentTimeMillis() - start));

      if (!(status >= 200 && status < 300)) {
        // If not fetched through plugin successfully
        // Falling back to default fetcher
        LOG.info(
            "{} Failed to fetch the page. Falling back to default fetcher.",
            resource.getUrl());
        return super.fetch(resource);
      }
      fetchedData = new FetchedData(html.getBytes(), "application/html",
          status);
      resource.setStatus(ResourceStatus.FETCHED.toString());
      fetchedData.setResource(resource);
      return fetchedData;
    } catch (Exception e) {
      LOG.info(
          "{} Failed to fetch the page. Falling back to default fetcher.",
          resource.getUrl());
      return super.fetch(resource);
    }
  }
arelaxend commented 7 years ago

Plus, for the selenium error, it happens when you build the app, but the jar file for the plugins are not generated! 👍

thammegowda commented 7 years ago

Thanks for the comment

Hi, I got: fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.

What command did you execute to get this error message? I hope you are aware of the way to raise pull request without having write permissions. If not, refer to http://stackoverflow.com/a/14681796/1506477. Basically, you (1) fork this repo (2) push your changes to your fork (3) raise a pull request from your fork to this repo.

Plus, for the selenium error, it happens when you build the app, but the jar file for the plugins are not generated!

Ah! That explains the NullPointerException - the 3rd stack trace.

thammegowda commented 7 years ago

Fixed in https://github.com/USCDataScience/sparkler/pull/61 Waiting for one more person to review before I merge

arelaxend commented 7 years ago

This is a good fix. I understood that you want to remove the pain of handling such errors on modules. 👍