databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

0.6.0 to 0.12.0 | spark local work spark on yarn crashes #553

Closed exandi closed 3 years ago

exandi commented 3 years ago

Hey There, I want to migrate from version 0.6.0 to Version 0.12.0 of the spark-xml lib. I use the pyspark interface to interact with the lib. Now I have the following Problem:

on Version 0.6.0 I can read the xml-files like: spark.read.format("xml").options(rowTag="books").load("/path/to/books/xml/") every thing works fine. if I use the 0.12.0 Version of the lib I got the following error:

21/08/19 14:33:18 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 14, HOSTNAME, executor 2): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2410) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

21/08/19 14:33:18 ERROR TaskSetManager: Task 3 in stage 1.0 failed 4 times; aborting job

Traceback (most recent call last): File "", line 1, in File "/usr/hdp/current/spark2-client/python/pyspark/sql/readwriter.py", line 166, in load return self._df(self._jreader.load(path)) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call File "/usr/hdp/current/spark2-client/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/hdp/current/spark2-client/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_returnvalue py4j.protocol.Py4JJavaError: An error occurred while calling o143.load. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 1.0 failed 4 times, most recent failure: Lost task 3.3 in stage 1.0 (TID 24, HOSTNAME, executor 2): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2410) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2039) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2136) at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.fold(RDD.scala:1092) at com.databricks.spark.xml.util.InferSchema$.infer(InferSchema.scala:112) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.xml.XmlRelation.(XmlRelation.scala:41) at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2301) at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1431) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2410) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more

If I run this code on master("local") it works fine. It just crashes if I use the yarn cluster of an hadoop env.

I publish the jar files to the executors via os.environ['PYSPARK_SUBMIT_ARGS'] = f'--jars {PYSPARK_SUBMIT_ARGS_JARS} pyspark-shell' before I import the pyspark libs and create the spark context.

I hope you can help me to fix this issue.

srowen commented 3 years ago

This is almost surely due to conflicting Scala versions. What version of Scala are you using with Spark, and did you match it with the library? you need the _2.11 version for 2.11, _2.12 for 2.12

exandi commented 3 years ago

I will check on this tomorrow, but from whats in my mind, I tried the other Version (2.11) and it crashed. After I changed to 2.12 It worked at least in local mode. Will provide more information tomorrow.

srowen commented 3 years ago

I'm almost certain that's your problem, yes. These errors are in any event not library related

exandi commented 3 years ago

Hey, the 2.11 Version works local as well but crashes with the following error in the yarn cluster:

Py4JJavaError: An error occurred while calling o155.load. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 0.0 failed 4 times, most recent failure: Lost task 6.3 in stage 0.0 (TID 39, HOSTNAME, executor 7): java.io.InvalidClassException: com.databricks.spark.xml.XmlOptions; local class incompatible: stream classdesc serialVersionUID = -1143996978792956522, local class serialVersionUID = 7337562432444654330 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2002) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1849) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2159) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1651) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1639) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1638) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:831) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2039) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2136) at org.apache.spark.rdd.RDD$$anonfun$fold$1.apply(RDD.scala:1098) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:363) at org.apache.spark.rdd.RDD.fold(RDD.scala:1092) at com.databricks.spark.xml.util.InferSchema$.infer(InferSchema.scala:112) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.xml.XmlRelation.(XmlRelation.scala:41) at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:341) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:174) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.InvalidClassException: com.databricks.spark.xml.XmlOptions; local class incompatible: stream classdesc serialVersionUID = -1143996978792956522, local class serialVersionUID = 7337562432444654330 at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699) at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2002) at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1849) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2159) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2404) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2328) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1666) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:502) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:460) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80) at org.apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more

srowen commented 3 years ago

That looks like you have mixed different versions of the library on your cluster

exandi commented 3 years ago

you mean different versions of the spark-xml lib? I create the spark context in an jupyter notebook and provide the lib before creating the context. sould I provide more detailed information?

srowen commented 3 years ago

I'm not sure exactly how it's happening, but yeah that is what it sounds like. This isn't related to the library directly

exandi commented 3 years ago

I just run 2 tests, there are 2 versions of the lib working, so it isn't just one. Will try to go deeper 0.5.0 crashes 0.6.0 runs 0.7.0 runs 0.8.0 crashes 0.9.0 crashes 0.10.0 crashes 0.11.0 crashes 0.12.0 crashes

exandi commented 3 years ago

are you sure it isn't related to the library? because of the two running versions of the lib. Is it possible that I need to provide some extra dependencies?

srowen commented 3 years ago

I am sure. This can only happen if you are mixing differently-compiled versions.