USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Catch missing plugins #113

Closed buggtb closed 6 years ago

buggtb commented 7 years ago

Issue Description

If we have plugins loading but they don't exist, Sparkler throws a random Spark error.

How to reproduce it

Deploy the app jar without any plugins and try to run a crawl

Environment and Version Information

Java 8, Linux

An external links for reference

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1.apply$mcVI$sp(Crawler.scala:155) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:135) at edu.usc.irds.sparkler.base.CliTool$class.run(CliTool.scala:34) at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:41) at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:218) at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala) ... 6 more Caused by: java.lang.NullPointerException at scala.collection.mutable.ArrayOps$ofRef$.newBuilder$extension(ArrayOps.scala:190) at scala.collection.mutable.ArrayOps$ofRef.newBuilder(ArrayOps.scala:186) at scala.collection.TraversableLike$class.filterImpl(TraversableLike.scala:246) at scala.collection.TraversableLike$class.filter(TraversableLike.scala:259) at scala.collection.mutable.ArrayOps$ofRef.filter(ArrayOps.scala:186) at edu.usc.irds.sparkler.service.PluginService$BundleLoader$.load(PluginService.scala:178) at edu.usc.irds.sparkler.service.PluginService.load(PluginService.scala:160) at edu.usc.irds.sparkler.service.PluginService$.getExtension(PluginService.scala:283) at edu.usc.irds.sparkler.pipeline.FetchFunction$.apply(FetchFunction.scala:49) at edu.usc.irds.sparkler.pipeline.FetchFunction$.apply(FetchFunction.scala:32) at edu.usc.irds.sparkler.pipeline.FairFetcher.(FairFetcher.scala:38) at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1$$anonfun$2.apply(Crawler.scala:144) at edu.usc.irds.sparkler.pipeline.Crawler$$anonfun$run$1$$anonfun$2.apply(Crawler.scala:144) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78) at org.apache.spark.rdd.RDD.iterator(RDD.scala:268) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306) at org.apache.spark.rdd.RDD.iterator(RDD.scala:270) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 2017-05-05 15:02:10 INFO PluginService$:119 [Felix-sjob-1493995629882] - Going to stop Services...

Contributing

If you'd like to help us fix the issue by contributing some code, but would like guidance or help in doing so, please mention it!

thammegowda commented 6 years ago

Resolved. We moved away from Felix. Reopen if the new plugin backend has issues