databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Cannot parse XML node value (text), when there is an inner XML node #592

Closed MartinChavezNC closed 2 years ago

MartinChavezNC commented 2 years ago

Cannot parse XML node value (text), when there is an inner XML node

Given the following XML:

<parent>
    <child>Child Text<childNode value="123"/></child>
</parent>

Reading the XML using Spark XML:

parents = spark.read.format('xml').options(rowTag='parent', valueTag='closeTag').load('spark_xml_bug.xml')
parents.printSchema()
parents.show()

res_schema

Does not give the desired result:

parents = parents.withColumn('child_text', col('child'))
parents = parents.withColumn('child_node_value', col('child.childNode._value'))
parents.show()

res_2

I expect the value for child_text to be 'Child Text' and not NULL.

Any ideas on how to solve this issue? @srowen , @HyukjinKwon

PS. I tried defining a schema and forcing child to be a String type (instead of struct), however, any other node after that is ignored (although I do see the string 'Child Text')

srowen commented 2 years ago

Duplicate of https://github.com/databricks/spark-xml/issues/516