SANSA-Stack / SANSA-Template-Maven-Spark

Maven Template Project for SANSA using Spark
Apache License 2.0
5 stars 5 forks source link

java.lang.NoClassDefFoundError: Could not initialize class org.apache.jena.riot.system.RiotLib #6

Open JNKHunter opened 4 years ago

JNKHunter commented 4 years ago

Hello,

When running the example on a Spark cluster using 'spark-submit', the following error is encountered. Any ideas what might be causing this?

Caused by: java.lang.NoClassDefFoundError: Could not initialize class org.apache.jena.riot.system.RiotLib
    at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$1.apply(NTripleReader.scala:135)
    at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$1.apply(NTripleReader.scala:118)
    at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.instance$lzycompute(NTripleReader.scala:207)
    at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.instance(NTripleReader.scala:207)
    at net.sansa_stack.rdf.spark.io.NonSerializableObjectWrapper.get(NTripleReader.scala:209)
    at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$load$1.apply(NTripleReader.scala:148)
    at net.sansa_stack.rdf.spark.io.NTripleReader$$anonfun$load$1.apply(NTripleReader.scala:140)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
LorenzBuehmann commented 4 years ago

Hi, did you use the latest 0.7.1 template? Or maybe can you just paste your POM file here? The idea of this Maven template was just to show how one can add the SANSA artifacts - basically indeed this is more just a minor guide for non experienced Maven user. But maybe you or we forgot something.

Also, can you describe how you created the Maven artifact? I guess mvn package which triggers the Maven Shade plugin?

JNKHunter commented 4 years ago

Hi Lorenz, thanks. I figured this was just a test dir for beginners.

I'm using the exact POM file from the develop branch, which is using the 0.7.2 version https://github.com/SANSA-Stack/SANSA-Template-Maven-Spark/blob/48adae0cb02407fc727d704b928417ed0003c940/pom.xml

And you're correct, I'm using mvn package to create the jar.

Do you recommend switching to the 0.7.1 version?

LorenzBuehmann commented 4 years ago

Well, the latest version should work ... so, no need to go back I think.

Let me check what's going wrong here. I've seen this issue before but I thought it has been resolved already - at least it shouldn't happen with the ResourcETransformer in the Maven Shade plugin enabled - which is the case.

By the way, I'll also reply to your mailing list question once I found a good answer.

kohpai commented 3 years ago

I'm also having the same issue on Spark 2.2.1, Scala 2.11.8, JDK 1.8

LorenzBuehmann commented 3 years ago

Hi.

Do you really want to use such an old Spark version? Also, SANSA-Stack has been migrated into a single repository in the meantime: https://github.com/SANSA-Stack/SANSA-Stack There should be documentation on how to add it to your POM file, i.e. which Maven artifacts as well as the repositories.

kohpai commented 3 years ago

I have just switched to Spark 2.4.8. Also tried the example in https://github.com/SANSA-Stack/SANSA-Stack, but the problem still persists. I now downgraded sansa to sansa-rdf-spark-core v0.3.0, it works. But I can only read NT files.

LorenzBuehmann commented 3 years ago

wait a second. what exactly do you want to do (loading which files) and what exactly are you doing to use SANSA? I mean, the Maven template is nothing more than a stub of the dependencies, you won't even need all of them if for example you just want to load the RDF data. And which file format do you want to load? The most efficient way is for sure N-Triples as this format is splittable.

kohpai commented 3 years ago

We want to use SANSA for loading RDF into Spark, like you have speculated. I am aware that we only need sansa-rdf-spark for that task. Ah, so N-Triples is more suitable? We wanted to use TTL solely because the file size is smaller.

kohpai commented 3 years ago

Just to update. I have tried many things. I couldn't fix it, but I found an obvious workaround that I didn't think of before; the -jars option when executing spark-submit. Basically just go ahead and download the necessary jars from http://archive.apache.org/dist/jena/binaries/. Then load all the jars when submitting the application. So now I can use SANSA 0.7.2 with Scala 2.12.10 and Spark 3.1.2.