Closed exandi closed 3 years ago
This is almost surely due to conflicting Scala versions. What version of Scala are you using with Spark, and did you match it with the library? you need the _2.11 version for 2.11, _2.12 for 2.12
I will check on this tomorrow, but from whats in my mind, I tried the other Version (2.11) and it crashed. After I changed to 2.12 It worked at least in local mode. Will provide more information tomorrow.
I'm almost certain that's your problem, yes. These errors are in any event not library related
Hey, the 2.11 Version works local as well but crashes with the following error in the yarn cluster:
Py4JJavaError: An error occurred while calling o155.load. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 39, HOSTNAME, executor 7): java.io.InvalidClassException: com.databricks.spark.xml.XmlOptions; local class incompatible: stream classdesc serialVersionUID = -1143996978792956522, local class serialVersionUID = 7337562432444654330 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2002) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1849) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2159) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)
Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2039) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2136) at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.fold(RDD.scala:1092) at com.databricks.spark.xml.util.InferSchema$.infer(InferSchema.scala:112) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.xml.XmlRelation.
(XmlRelation.scala:41) at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.InvalidClassException: com.databricks.spark.xml.XmlOptions; local class incompatible: stream classdesc serialVersionUID = -1143996978792956522, local class serialVersionUID = 7337562432444654330 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2002) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1849) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2159) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more
That looks like you have mixed different versions of the library on your cluster
you mean different versions of the spark-xml lib? I create the spark context in an jupyter notebook and provide the lib before creating the context. sould I provide more detailed information?
I'm not sure exactly how it's happening, but yeah that is what it sounds like. This isn't related to the library directly
I just run 2 tests, there are 2 versions of the lib working, so it isn't just one. Will try to go deeper 0.5.0 crashes 0.6.0 runs 0.7.0 runs 0.8.0 crashes 0.9.0 crashes 0.10.0 crashes 0.11.0 crashes 0.12.0 crashes
are you sure it isn't related to the library? because of the two running versions of the lib. Is it possible that I need to provide some extra dependencies?
I am sure. This can only happen if you are mixing differently-compiled versions.
Hey There, I want to migrate from version 0.6.0 to Version 0.12.0 of the spark-xml lib. I use the pyspark interface to interact with the lib. Now I have the following Problem:
on Version 0.6.0 I can read the xml-files like:
spark.read.format("xml").options(rowTag="books").load("/path/to/books/xml/")
every thing works fine. if I use the 0.12.0 Version of the lib I got the following error:If I run this code on
master("local")
it works fine. It just crashes if I use the yarn cluster of an hadoop env.I publish the jar files to the executors via
os.environ['PYSPARK_SUBMIT_ARGS'] = f'--jars {PYSPARK_SUBMIT_ARGS_JARS} pyspark-shell'
before I import the pyspark libs and create the spark context.I hope you can help me to fix this issue.