databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Using spark-xml to parse nested xml structure in jupyter notebook #646

Closed Xabitsuki closed 1 year ago

Xabitsuki commented 1 year ago

Hi I am trying to use spark-xml to parse a column in a dataframe that contains a string of xml. Following on the read me, I tried to use

from pyspark.sql.column import Column, _to_java_column
from pyspark.sql.types import _parse_datatype_json_string

def ext_from_xml(xml_column, schema, options={}):
    java_column = _to_java_column(xml_column.cast('string'))
    java_schema = spark._jsparkSession.parseDataType(schema.json())
    scala_map = spark._jvm.org.apache.spark.api.python.PythonUtils.toScalaMap(options)
    jc = spark._jvm.com.databricks.spark.xml.functions.from_xml(
        java_column, java_schema, scala_map)
    return Column(jc)

def ext_schema_of_xml_df(df, options={}):
    assert len(df.columns) == 1

    scala_options = spark._jvm.PythonUtils.toScalaMap(options)
    java_xml_module = getattr(getattr(
        spark._jvm.com.databricks.spark.xml, "package$"), "MODULE$")
    java_schema = java_xml_module.schema_of_xml_df(df._jdf, scala_options)
    return _parse_datatype_json_string(java_schema.json())

to this very simple data frame:


data = [
    ('<root><name>John</name><surname>Doe</surname><list>    <item>1</item>    <item>2</item>    <item>3</item>   <item>4</item>  </list>  <date>2023.03.02</date></root>', 1000) 
] 

schema = ['xml', 'nbr']

df = spark.createDataframe(schema=schema, data=data)

xmlSchema = ext_schema_of_xml_df(df.select("xml_obj")) # error line 
parsed_df = df.withColumn("parsed", ext_from_xml(col("xml_obj"), xmlSchema)) # error line 

I get

TypeError: JavaPackage is not callable 

---> java_schema = spark._jsparkSession.parseDataType(schema.json())

Setup:

Pyspark / Spark 2.4.7 Scala 2.11.12 com.databricks:spark-xml_2.11:0.13.0

Sparksession:

I create the spark session using SparkConf to enter all these config params.

srowen commented 1 year ago

I think this says you don't actually have the Spark XML library installed. Use --packages when you submit, as shown in the README

Xabitsuki commented 1 year ago

Hi, thanks for your answer, tried submitting a job passing the jar using --jars of spark-submit but it did not work...