Closed sandeep-katta0102 closed 1 year ago
I think the answer is that the input is malformed w.r.t. the schema, and this uses the default 'permissive' error handling mode unless configured otherwise, so parsing stops when it hits the problem and the remaining cols are null. You have a partial record here. You can change to "FAILFAST" parsing mode to generate an error. You can also supply a column in your schema that matches what is configured as the col name for corrupted records, and it will output the XML in that col for you too
Whatever the mode
I use result is same i.e. always author column
is null
Hm yeah that's not quite the issue. I think the issue is the text alongside the child element in <title>
. That does not work; for a discussion, see: https://github.com/databricks/spark-xml/issues/516
What I can't see immediately is why parsing does not fail with a custom schema. It's going to be related to the fact that it does not expect any text here.
I'll leave it open but I don't know how to fix it and dont' have time to debug, but I'd certainly look at more analysis or a pull request if anyone has bright ideas.
I used the following the XML file and the custom schema
As per below output you can see that author is null but it should not be null