databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Py4JJavaError: An error occurred while calling o352.load. #572

Closed vnteleah closed 2 years ago

vnteleah commented 2 years ago

I'm having issues when I try to parse books.xml

My tools: Python 3.6.3 Spark 2.4.3

Setting up spark: spark = SparkSession\ .builder\ .config('com.databricks:spark-xml_2.12:0.14.0')\ .getOrCreate()

I even tried using: spark = SparkSession\ .builder\ .config('spark.jars.packages')\ .getOrCreate()

Parsing: df = spark.read.format('xml')\ .options(rowTag = 'book')\ .load('books.xml')

Attached is the error message I receive when I try to parse the book.xml file you provided.

What might be causing this issue? issue1

srowen commented 2 years ago

It looks like you have not added the spark-xml library. .config('com.databricks:spark-xml_2.12:0.14.0') is not how you add dependences

vnteleah commented 2 years ago

How do I add the spark-xml library to my spark session then?

srowen commented 2 years ago

I don't think you can add them after startup. You would need to put it on the path before startup. I'm not sure, does the %configure magic work in Jupyter? I thnk that might be specific to this Spark distribution: https://docs.microsoft.com/en-us/azure/hdinsight/spark/apache-spark-jupyter-notebook-use-external-packages