databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

rowtag not recognised when using ext_from_xml #597

Closed charlottevdscheun closed 2 years ago

charlottevdscheun commented 2 years ago

I am using this function to extract from xml: https://github.com/databricks/spark-xml#pyspark-notes The problem is that it doesnt seem to get the row tag from the options. It works when i include the rowtag inside the schema. but that is what the rowtag option is for.

my code:

df_read = (
    spark.readStream.format('text')
    .options({'wholetext': True})
    .load('/src/test.xml')
)

df_xml = df_read.withColumn(
    "value",
    ext_from_xml(
        F.col("value"),
        schema=schema,
        options={'rowTag': 'article'},
    ),
)

/src/test.xml:

<xml><article><number>1</number><description>Cheese</description></article></xml>

df_read:

+--------------------+
|               value|
+--------------------+
|<xml><article><nu...|
+--------------------+

schema:

StructType(List(StructField(number,LongType,true),StructField(description,StringType,true)))

df_xml:

+--------------------+
| number| description|
+--------------------+
|  null|   null|
+--------------------+

If i add the rowtag inside the schema, df_xml will look like this:

+--------------------+
|           article|
+--------------------+
|[{1, Cheese}]|
+--------------------+

Then the rowtag is included in the table (as expected cause i included it in the schema) But i want the rowtag to be recognised from the options and not included in the table and schema. This is the table that i am expecting: df_xml:

+--------------------+
| number| description|
+--------------------+
|  1|   Cheese|
+--------------------+

Did i provide the wrong options? Or does the rowtag just not work in the ext_from_xml?

srowen commented 2 years ago

The assumption is that you already have your rows - you have a DataFrame of rows - so rowTag is not relevant. The XML in the col is all to be parsed. I could look at whether it's reasonable to provide a separate function that parses results into an array and uses rowTag, not sure if it's hard.

charlottevdscheun commented 2 years ago

Hi srowen,

Thanks for the response. I understand that it doesnt look at rowtag if it expects to get dataframe rows. For my use case I fixed it with this:

# add rowtag into schema
rowtag = 'article'
schema_xml = StructType([StructField(rowtag, ArrayType((schema), True))])

df_xml = df_read.withColumn(
    "value",
    ext_from_xml(  
        F.col("value"),
        schema=schema_xml
    ),   
)
df_xml.select(F.explode(f"value.{rowtag}")).select("col.*")