databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Issue with scala: java.lang.NoClassDefFoundError: scala/$less$colon$less #632

Closed coperator closed 1 year ago

coperator commented 1 year ago

In an openjdk docker image I installed python and, then, pyspark with pip and used

spark = SparkSession.builder.appName("MyApp") \
            .config("spark.jars.packages", "com.databricks:spark-xml_2.13:0.16.0") \
            .getOrCreate()

to load spark-xml. When calling

dataFrame = spark.read\
    .format("xml")\
    .option("rowTag", "Obs")\
    .load("/data/test.xml")

I get the following error:

raceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "/usr/local/lib/python3.7/site-packages/pyspark/sql/readwriter.py", line 158, in load
    return self._df(self._jreader.load(path))
  File "/usr/local/lib/python3.7/site-packages/py4j/java_gateway.py", line 1310, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/local/lib/python3.7/site-packages/pyspark/sql/utils.py", line 111, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python3.7/site-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: java.lang.NoClassDefFoundError: scala/$less$colon$less
    at com.databricks.spark.xml.XmlOptions$.apply(XmlOptions.scala:82)
    at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:66)
    at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:274)
    at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:245)
    at scala.Option.getOrElse(Option.scala:189)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:245)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.lang.ClassNotFoundException: scala.$less$colon$less
    at java.net.URLClassLoader.findClass(URLClassLoader.java:387)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    ... 21 more

What am I missing?

coperator commented 1 year ago

Is it a version mismatch issue? I used pyspark version 3.2.0, openjdk version 8, python version 3.9.8 and com.databricks:spark-xml_2.13:0.16.0

srowen commented 1 year ago

It is, a Scala version mismatch. You are probably using Scala 2.12, but have added the 2.13 artifact. Use _2.12

coperator commented 1 year ago

Ah, yes! Thank you for the quick reply. Easy to miss. Maybe it would be good to highlight the fact that the versions must match in the documentation of spark-xml. So, default of Spark and PySpark is currently still scala 2.12.

srowen commented 1 year ago

That's just true of any Scala dependency though, I think many people would know that and/or you'd find that with other libraries