Closed lotsahelp closed 2 years ago
I can't reproduce this from the snippet; I think this might be an issue with the data (meaning could be a known issue as well). Can we see more of the actual input?
I cannot release the file, nor the complete structure. I'll see if I can come up with a file that will reproduce the issue.
I have a repro now. One quick note while I'm debugging - try setting treatEmptyValuesAsNulls
, which seems to resolve it and may even be what you mean. Still need to figure out the underlying issue as the second result should not happen.
Scratch that -- actually set the option nullValue
to ""
. I think this is more or less the answer, having stepped through the code. It gets an empty string for what should be a double, and that's an error. The resulting row is a partial read, hence the extra null. You can change the parse mode to FAILFAST
or something to make the error more apparent.
The question though is, why was this col inferred as double, if it had a value that can't be a double? This is inconsistent. It should be string, really, if ""
is not treated as null. I could 'fix' that, though, in this case, double is actually what you want.
Thanks @srowen, that fixed our issue. Very much appreciated.
Sample xml
<Invoices> <Invoice amountDue="33.00" discount="1.00" ...> ... </Invoice> <Invoice amountDue="25.00" discount="" ...> ... </Invoice> </Invoices>
%python df=spark.read.format("com.databricks.spark.xml").option("rootTag", "Invoices").option("rowTag","Invoice").load("/tmp/myxml.xml") df.createOrReplaceTempView("invoice_xml")
No nulls if I select just _amountDue
If I include _discount, the whole row becomes null (row 5)
WHERE _amountDue is null
WHERE _discount is null / is not null shows appropriate amounts.