USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Exclude net.jpountz.lz4 lz4 from kafka-clients dependency in sparkler-app/pom.xml #236

Closed lewismc closed 3 years ago

lewismc commented 3 years ago

I follow the quick-start-running-your-first-crawl-job-in-minutes Once the Docker contaier is running and I have exec'd into it I inject a seed URL

bash-4.2$ /data/sparkler/bin/sparkler.sh inject -id 1 -su 'https://www.jpl.nasa.gov'
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2021-06-30 04:40:13 INFO  PluginService$:53 - Loading plugins...
2021-06-30 04:40:13 INFO  PluginService$:62 - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2021-06-30 04:40:14 WARN  PluginService$:65 - 4 extra plugin(s) available but not activated: Set(fetcher-chrome, scorer-dd-svn, fetcher-jbrowser, fetcher-htmlunit)
2021-06-30 04:40:14 DEBUG PluginService$:68 - Loading urlfilter-regex
2021-06-30 04:40:14 INFO  PluginService$:73 - Extensions found: []
2021-06-30 04:40:14 DEBUG PluginService$:68 - Loading urlfilter-samehost
2021-06-30 04:40:14 INFO  PluginService$:73 - Extensions found: []
2021-06-30 04:40:14 INFO  PluginService$:82 - Recognised Plugins: Map()
2021-06-30 04:40:14 INFO  Injector$:108 - Injecting 1 seeds
>>jobId = 1
2021-06-30 04:40:14 WARN  PluginService$:49 - Stopping all plugins... Runtime is about to exit.

Then attempt to crawl

bash-4.2$ /data/sparkler/bin/sparkler.sh crawl -id 1 -tn 100 -i 2
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.logging.log4j.log4j-slf4j-impl-2.11.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.slf4j.slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/data/sparkler/sparkler-app-0.3.1-SNAPSHOT/lib/org.apache.spark.spark-unsafe_2.12-3.0.1.jar) to constructor java.nio.DirectByteBuffer(long,int)
WARNING: Please consider reporting this to the maintainers of org.apache.spark.unsafe.Platform
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
2021-06-30 04:40:45 WARN  NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2021-06-30 04:40:47 INFO  Crawler$:160 - Setting local job: {User-Agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Sparkler/${project.version}, Accept=text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8, Accept-Language=en-US,en}
2021-06-30 04:40:47 INFO  Crawler$:174 - Committing crawldb..
2021-06-30 04:40:47 INFO  Crawler$:219 - Starting the job:1, task:1eccd733-fb37-423c-9862-629d9c045707
2021-06-30 04:40:47 INFO  MemexCrawlDbRDD$:54 - selecting 1 out of 1
2021-06-30 04:40:48 DEBUG SolrResultIterator$:63 - Query status:UNFETCHED, Start = 0
2021-06-30 04:40:48 DEBUG SolrResultIterator$:77 - Reached the end of result set
2021-06-30 04:40:48 DEBUG SolrResultIterator$:79 - closing solr client.
2021-06-30 04:40:48 WARN  BlockManager:69 - Block rdd_3_0 could not be removed as it was not found on disk or in memory
2021-06-30 04:40:48 ERROR Executor:94 - Exception in task 0.0 in stage 1.0 (TID 1)
java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
    at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486) ~[org.scala-lang.scala-library-2.12.12.jar:?]
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492) ~[org.scala-lang.scala-library-2.12.12.jar:?]
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) ~[org.scala-lang.scala-library-2.12.12.jar:?]
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:311) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.scheduler.Task.run(Task.scala:127) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) ~[org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449) [org.apache.spark.spark-core_2.12-3.0.1.jar:3.0.1]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
    at java.lang.Thread.run(Thread.java:830) [?:?]
2021-06-30 04:40:48 WARN  TaskSetManager:69 - Lost task 0.0 in stage 1.0 (TID 1, 5d3eb0d88dbf, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
    at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
    at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
    at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
    at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
    at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:830)

2021-06-30 04:40:48 ERROR TaskSetManager:73 - Task 0 in stage 1.0 failed 1 times; aborting job
Exception in thread "main" java.lang.reflect.InvocationTargetException
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:567)
    at edu.usc.irds.sparkler.Main$.main(Main.scala:50)
    at edu.usc.irds.sparkler.Main.main(Main.scala)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 5d3eb0d88dbf, executor driver): java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
    at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
    at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
    at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
    at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
    at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:830)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2059)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2008)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2007)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2007)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:973)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:973)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:973)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2239)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2188)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2177)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:775)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2099)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2120)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2152)
    at edu.usc.irds.sparkler.pipeline.Crawler.score(Crawler.scala:254)
    at edu.usc.irds.sparkler.pipeline.Crawler.$anonfun$run$1(Crawler.scala:231)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158)
    at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:179)
    at edu.usc.irds.sparkler.base.CliTool.run(CliTool.scala:34)
    at edu.usc.irds.sparkler.base.CliTool.run$(CliTool.scala:32)
    at edu.usc.irds.sparkler.pipeline.Crawler.run(Crawler.scala:50)
    at edu.usc.irds.sparkler.pipeline.Crawler$.main(Crawler.scala:338)
    at edu.usc.irds.sparkler.pipeline.Crawler.main(Crawler.scala)
    ... 6 more
Caused by: java.lang.NoSuchMethodError: 'void net.jpountz.lz4.LZ4BlockInputStream.<init>(java.io.InputStream, net.jpountz.lz4.LZ4FastDecompressor, java.util.zip.Checksum, boolean)'
    at org.apache.spark.io.LZ4CompressionCodec.compressedInputStream(CompressionCodec.scala:154)
    at org.apache.spark.serializer.SerializerManager.wrapForCompression(SerializerManager.scala:165)
    at org.apache.spark.serializer.SerializerManager.wrapStream(SerializerManager.scala:126)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.$anonfun$read$1(BlockStoreShuffleReader.scala:74)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:630)
    at org.apache.spark.storage.ShuffleBlockFetcherIterator.next(ShuffleBlockFetcherIterator.scala:70)
    at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:29)
    at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:486)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:492)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:31)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:155)
    at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:41)
    at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:116)
    at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:106)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.$anonfun$getOrCompute$1(RDD.scala:362)
    at org.apache.spark.storage.BlockManager.$anonfun$doPutIterator$1(BlockManager.scala:1371)
    at org.apache.spark.storage.BlockManager.org$apache$spark$storage$BlockManager$$doPut(BlockManager.scala:1298)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1362)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:1186)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:360)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:311)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:446)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:449)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:830)

This ofcourse has to do with a bad method signature definition related to net.jpountz.lz4.lz4-1.3.0.jar. I found a SO response confirming that it is an issue with the kafka-clients*.jar dependency. In the case of Sparkler, that means org.apache.kafka.kafka-clients-0.10.0.0.jar. Specifically, the `` dependency needs to be removed from the kafka-clients dependency in sparkler-app/pom.xml.

Ontopic commented 3 years ago

Really hoping this PR can be accepted. Have no experience with Java and was happy to get to the point Lewis was at with a (seemingly) perfect fix. I normally would pull this in myself, but I have no clue where to even look for a pom file 🙃

lewismc commented 3 years ago

Thanks for sharing your experience @Ontopic I just merged it into master branch.