Closed clemj21 closed 1 year ago
I don't understand the issue here. Please show the XML and result.
Sure thing.
XML
<xml>
<items>
<item>
<a>
<b>testing</b>
</a>
<item>
</item>
<a>
</a>
</item>
</items>
</xml>
When I load the XML with these options:
testdf = spark.read.format('xml') \
.option("rowTag" ,"item") \
.option("inferSchema", "false") \
.option("ignoreSurroundingSpaces", "true") \
.load('dbfs:/user/test/text.xml')
I get this unwanted result:
When I run with the "treatEmptyValuesAsNulls", "true" option I get the wanted result:
testdf = spark.read.format('xml') \
.option("rowTag" ,"item") \
.option("inferSchema", "false") \
.option("ignoreSurroundingSpaces", "true") \
.option("treatEmptyValuesAsNulls", "true") \
.load('dbfs:/user/test/text.xml')
The README states that this option is DEPRECATED: treatEmptyValuesAsNulls : (DEPRECATED: use nullValue set to "") Whether you want to treat whitespaces as a null value. Default is false
I am assuming I don't understand where to set nullValue, I set it in the options and get the unwanted result:
testdf = spark.read.format('xml') \
.option("rowTag" ,"item") \
.option("inferSchema", "false") \
.option("ignoreSurroundingSpaces", "true") \
.option("nullValue", "") \
.load('dbfs:/user/test/text.xml')
spark-xml version:2.12:0.15.0
Thanks for your help.
(Minor, there's a problem with your XML example - the item close/open tags are switched in the middle. I imagine that's just a typo in the comment)
I think this example is 'invalid' or at least ambiguous, as given. For a
to be null in the second row, it would have to not exist. For a
to contain a null value b
, <b>
would have to exist but be empty. It's not clear what an empty <a>
should mean here. Is a null? or does a exist with a null b?
The behavior of the two settings does result in two different interpretations, yeah, though I think the nullValue
one is more correct. It kind of gives up and assumes it is a string. Is this example like your real data?
For the scenario
When reading an XML file the resulting DataFrame schema show a as an array and b as a string which I would expect.
when reading with .option("treatEmptyValuesAsNulls,"true") the resulting DataFrame schema for a is an array with element : struct and b as a string which is what I want.
Since treatEmptyValueAsNulls has been DEPRECATED I used the suggested nullValue and set to "" .option("nullValue",""). The resulting DataFrame schema it is back to showing a as an array and element:string which I do not want.
Since treatEmptyValueAsNulls has been DEPRECATED how do I achieve the same result using nullValue?
Thanks.