Closed charlottevdscheun closed 2 years ago
The assumption is that you already have your rows - you have a DataFrame of rows - so rowTag is not relevant. The XML in the col is all to be parsed. I could look at whether it's reasonable to provide a separate function that parses results into an array and uses rowTag, not sure if it's hard.
Hi srowen,
Thanks for the response. I understand that it doesnt look at rowtag if it expects to get dataframe rows. For my use case I fixed it with this:
# add rowtag into schema
rowtag = 'article'
schema_xml = StructType([StructField(rowtag, ArrayType((schema), True))])
df_xml = df_read.withColumn(
"value",
ext_from_xml(
F.col("value"),
schema=schema_xml
),
)
df_xml.select(F.explode(f"value.{rowtag}")).select("col.*")
I am using this function to extract from xml: https://github.com/databricks/spark-xml#pyspark-notes The problem is that it doesnt seem to get the row tag from the options. It works when i include the rowtag inside the schema. but that is what the rowtag option is for.
my code:
/src/test.xml:
df_read:
schema:
df_xml:
If i add the rowtag inside the schema, df_xml will look like this:
Then the rowtag is included in the table (as expected cause i included it in the schema) But i want the rowtag to be recognised from the options and not included in the table and schema. This is the table that i am expecting: df_xml:
Did i provide the wrong options? Or does the rowtag just not work in the ext_from_xml?