databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

DEPRECATED treatEmptyValueAsNulls works/suggested nullValue set to "" does not #604

Closed clemj21 closed 1 year ago

clemj21 commented 2 years ago

For the scenario

testing

When reading an XML file the resulting DataFrame schema show a as an array and b as a string which I would expect.

when reading with .option("treatEmptyValuesAsNulls,"true") the resulting DataFrame schema for a is an array with element : struct and b as a string which is what I want.

Since treatEmptyValueAsNulls has been DEPRECATED I used the suggested nullValue and set to "" .option("nullValue",""). The resulting DataFrame schema it is back to showing a as an array and element:string which I do not want.

Since treatEmptyValueAsNulls has been DEPRECATED how do I achieve the same result using nullValue?

Thanks.

srowen commented 2 years ago

I don't understand the issue here. Please show the XML and result.

clemj21 commented 2 years ago

Sure thing.

XML

<xml>
  <items>
    <item>
      <a>
        <b>testing</b>
       </a>
    <item>
    </item>
      <a>
      </a>
     </item>
  </items>
</xml>

When I load the XML with these options:

testdf = spark.read.format('xml') \
    .option("rowTag" ,"item") \
    .option("inferSchema", "false") \
    .option("ignoreSurroundingSpaces", "true") \
    .load('dbfs:/user/test/text.xml')

I get this unwanted result: image

When I run with the "treatEmptyValuesAsNulls", "true" option I get the wanted result:

testdf = spark.read.format('xml') \
    .option("rowTag" ,"item") \
    .option("inferSchema", "false") \
    .option("ignoreSurroundingSpaces", "true") \
    .option("treatEmptyValuesAsNulls", "true") \
    .load('dbfs:/user/test/text.xml')

image

The README states that this option is DEPRECATED: treatEmptyValuesAsNulls : (DEPRECATED: use nullValue set to "") Whether you want to treat whitespaces as a null value. Default is false

I am assuming I don't understand where to set nullValue, I set it in the options and get the unwanted result:

testdf = spark.read.format('xml') \
    .option("rowTag" ,"item") \
    .option("inferSchema", "false") \
    .option("ignoreSurroundingSpaces", "true") \
    .option("nullValue", "") \
    .load('dbfs:/user/test/text.xml')

image

spark-xml version:2.12:0.15.0

Thanks for your help.

srowen commented 2 years ago

(Minor, there's a problem with your XML example - the item close/open tags are switched in the middle. I imagine that's just a typo in the comment)

I think this example is 'invalid' or at least ambiguous, as given. For a to be null in the second row, it would have to not exist. For a to contain a null value b, <b> would have to exist but be empty. It's not clear what an empty <a> should mean here. Is a null? or does a exist with a null b?

The behavior of the two settings does result in two different interpretations, yeah, though I think the nullValue one is more correct. It kind of gives up and assumes it is a string. Is this example like your real data?