databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Attributes are both null and not null #578

Closed lotsahelp closed 2 years ago

lotsahelp commented 2 years ago

Sample xml <Invoices> <Invoice amountDue="33.00" discount="1.00" ...> ... </Invoice> <Invoice amountDue="25.00" discount="" ...> ... </Invoice> </Invoices>

%python df=spark.read.format("com.databricks.spark.xml").option("rootTag", "Invoices").option("rowTag","Invoice").load("/tmp/myxml.xml") df.createOrReplaceTempView("invoice_xml")

No nulls if I select just _amountDue image

If I include _discount, the whole row becomes null (row 5) image

WHERE _amountDue is null image

WHERE _discount is null / is not null shows appropriate amounts.

srowen commented 2 years ago

I can't reproduce this from the snippet; I think this might be an issue with the data (meaning could be a known issue as well). Can we see more of the actual input?

lotsahelp commented 2 years ago

I cannot release the file, nor the complete structure. I'll see if I can come up with a file that will reproduce the issue.

srowen commented 2 years ago

I have a repro now. One quick note while I'm debugging - try setting treatEmptyValuesAsNulls, which seems to resolve it and may even be what you mean. Still need to figure out the underlying issue as the second result should not happen.

srowen commented 2 years ago

Scratch that -- actually set the option nullValue to "". I think this is more or less the answer, having stepped through the code. It gets an empty string for what should be a double, and that's an error. The resulting row is a partial read, hence the extra null. You can change the parse mode to FAILFAST or something to make the error more apparent.

The question though is, why was this col inferred as double, if it had a value that can't be a double? This is inconsistent. It should be string, really, if "" is not treated as null. I could 'fix' that, though, in this case, double is actually what you want.

lotsahelp commented 2 years ago

Thanks @srowen, that fixed our issue. Very much appreciated.