Closed arelaxend closed 7 years ago
To solve the issue, just add a throw in public FetchedData fetch(Resource resource) throws Exception, and also add a try/catch in public FetchedData next() around the fetch() function. Cheers
@arelaxend Thanks for reporting this issue and also for a quick solution. I will investigate more before putting it around try and catch. Just a suggestion - If you know what URLs are causing this issue and you don't want to crawl them as well, place URL regex filter(s). It will save you some time. :)
Thanks @arelaxend If you could submit a pull request with those fixes, it will be awesome :+1: :1st_place_medal:
@karanjeets
Instead of
Just a suggestion - If you know what URLs are causing this issue and you don't want to crawl them as well, place URL regex filter(s). It will save you some time. :)
I suggest we do this:
To solve the issue, just add a throw in public FetchedData fetch(Resource resource) throws Exception, and also add a try/catch in public FetchedData next() around the fetch() function.
Hi, I got:
fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.
Besides, you just have to update the function fetch() in the plugin fetcher-jbbrowser. I choose to do that to be coherent with what you have done in the fetch() function on the app folder. 👍 And so, it means to change the implementation of the fetch() function by the following code:
public FetchedData fetch(Resource resource) {
LOG.info("JBrowser FETCHER {}", resource.getUrl());
FetchedData fetchedData;
/*
* In this plugin we will work on only HTML data
* If data is of any other data type like image, pdf etc plugin will return client error
* so it can be fetched using default Fetcher
*/
try {
if (!isWebPage(resource.getUrl())) {
LOG.debug("{} not a html. Falling back to default fetcher.",
resource.getUrl());
//This should be true for all URLS ending with 4 character file extension
//return new FetchedData("".getBytes(), "application/html", ERROR_CODE) ;
return super.fetch(resource);
}
long start = System.currentTimeMillis();
LOG.debug("Time taken to create driver- {}",
(System.currentTimeMillis() - start));
// This will block for the page load and any
// associated AJAX requests
driver.get(resource.getUrl());
int status = driver.getStatusCode();
//content-type
// Returns the page source in its current state, including
// any DOM updates that occurred after page load
String html = driver.getPageSource();
//quitBrowserInstance(driver);
LOG.debug("Time taken to load {} - {} ", resource.getUrl(),
(System.currentTimeMillis() - start));
if (!(status >= 200 && status < 300)) {
// If not fetched through plugin successfully
// Falling back to default fetcher
LOG.info(
"{} Failed to fetch the page. Falling back to default fetcher.",
resource.getUrl());
return super.fetch(resource);
}
fetchedData = new FetchedData(html.getBytes(), "application/html",
status);
resource.setStatus(ResourceStatus.FETCHED.toString());
fetchedData.setResource(resource);
return fetchedData;
} catch (Exception e) {
LOG.info(
"{} Failed to fetch the page. Falling back to default fetcher.",
resource.getUrl());
return super.fetch(resource);
}
}
Plus, for the selenium error, it happens when you build the app, but the jar file for the plugins are not generated! 👍
Thanks for the comment
Hi, I got: fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists.
What command did you execute to get this error message? I hope you are aware of the way to raise pull request without having write permissions. If not, refer to http://stackoverflow.com/a/14681796/1506477. Basically, you (1) fork this repo (2) push your changes to your fork (3) raise a pull request from your fork to this repo.
Plus, for the selenium error, it happens when you build the app, but the jar file for the plugins are not generated!
Ah! That explains the NullPointerException - the 3rd stack trace.
Fixed in https://github.com/USCDataScience/sparkler/pull/61 Waiting for one more person to review before I merge
This is a good fix. I understood that you want to remove the pain of handling such errors on modules. 👍
Hi!
I encounter some errors. The program is crashing for 10 crawls and I have the following errors (i put bold chars). Can you help me to figure out why ?
Best,
1st
2016-12-26 16:40:24 ERROR Executor:95 [Executor task launch worker-1] - Exception in task 3.0 in stage 1.0 (TID 8) org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver
at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more2nd
2016-12-26 16:40:24 ERROR TaskSetManager:74 [task-result-getter-3] - Task 3 in stage 1.0 failed 1 times; aborting job Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at edu.usc.irds.sparkler.Main$.main(Main.scala:47) at edu.usc.irds.sparkler.Main.main(Main.scala) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 1 times, most recent failure: Lost task 3.0 in stage 1.0 (TID 8, localhost): org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown'** System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 moreDriver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:139) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:121) at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:40) at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:211) at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala) ... 6 more Caused by: org.openqa.selenium.WebDriverException: Build info: version: 'unknown', revision: 'unknown', time: 'unknown' System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver at com.machinepublishers.jbrowserdriver.Util.handleException(Util.java:140) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:646) at edu.usc.irds.sparkler.plugin.FetcherJBrowser.fetch(FetcherJBrowser.java:81) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:77) at edu.usc.irds.sparkler.util.FetcherDefault$FetchIterator.next(FetcherDefault.java:60) at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:52) at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.EOFException at java.io.DataInputStream.readByte(DataInputStream.java:267) at sun.rmi.transport.StreamRemoteCall.executeCall(StreamRemoteCall.java:215) at sun.rmi.server.UnicastRef.invoke(UnicastRef.java:162) at java.rmi.server.RemoteObjectInvocationHandler.invokeRemoteMethod(RemoteObjectInvocationHandler.java:227) at java.rmi.server.RemoteObjectInvocationHandler.invoke(RemoteObjectInvocationHandler.java:179) at com.machinepublishers.jbrowserdriver.$Proxy18.get(Unknown Source) at com.machinepublishers.jbrowserdriver.JBrowserDriver.get(JBrowserDriver.java:643) ... 20 more
3rd
ERROR Utils:95 [Executor task launch worker-2] - Uncaught exception in thread Executor task launch worker-2 java.lang.NullPointerException
at org.apache.spark.scheduler.Task$$anonfun$run$1.apply$mcV$sp(Task.scala:95) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1229) at org.apache.spark.scheduler.Task.run(Task.scala:93) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2016-12-26 16:40:57 ERROR Executor:95 [Executor task launch worker-2] - Exception in task 1.0 in stage 1.0 (TID 6) java.util.NoSuchElementException: key not found: 6** at scala.collection.MapLike$class.default(MapLike.scala:228) at scala.collection.AbstractMap.default(Map.scala:59) at scala.collection.mutable.HashMap.apply(HashMap.scala:65) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:322) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Exception in thread "Executor task launch worker-2" java.lang.IllegalStateException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Exception in thread "Executor task launch worker-4" java.lang.IllegalStateException: RpcEnv already stopped. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:159) at org.apache.spark.rpc.netty.Dispatcher.postOneWayMessage(Dispatcher.scala:131) at org.apache.spark.rpc.netty.NettyRpcEnv.send(NettyRpcEnv.scala:192) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.send(NettyRpcEnv.scala:516) at org.apache.spark.scheduler.local.LocalBackend.statusUpdate(LocalBackend.scala:151) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:317) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)...
2016-12-26 16:42:35 DEBUG FetcherJBrowser:153 [FelixStartLevel] - Exception Connection refused Build info: version: 'unknown', revision: 'unknown', time: 'unknown' System info: host: '', ip: '', os.name: 'Mac OS X', os.arch: 'x86_64', os.version: '10.12.2', java.version: '1.8.0_91' Driver info: driver.version: JBrowserDriver raised. The driver is either already closed or this is an unknown exception
Process finished with exit code 1