databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

XML parser results in wrong behaviour when custom schema is used #574

Closed sandeep-katta0102 closed 1 year ago

sandeep-katta0102 commented 2 years ago

I used the following the XML file and the custom schema

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<invalidSchema>
    <book>
        <title>Five point someone
            <!-- This tag is not present in the schema provided -->
            <_ isApplicable="false" source="">test book</_>
        </title>
    </book>
    <author>
        <fname>Gambardella, Matthew</fname>
    </author>
</invalidSchema>
val schema = buildSchema(
      field("book",
        struct(field("title",
          struct(field("text_val", StringType))))),
      field("author",
        struct(field("fname",
          struct(field("text_val", StringType)))))
    )

    val invalid_schema_xml = resDir + "invalidSchema.xml"
    val results = spark.read.format("xml").option("rowTag", "invalidSchema")
      .option("valueTag", "text_val")
      .option("ignoreSurroundingSpaces", true)
      .schema(schema)
      .load(invalid_schema_xml)
    results.show(false)

As per below output you can see that author is null but it should not be null

+----------------------+------+
|book                  |author|
+----------------------+------+
|{{Five point someone}}|null  |
+----------------------+------+
srowen commented 2 years ago

I think the answer is that the input is malformed w.r.t. the schema, and this uses the default 'permissive' error handling mode unless configured otherwise, so parsing stops when it hits the problem and the remaining cols are null. You have a partial record here. You can change to "FAILFAST" parsing mode to generate an error. You can also supply a column in your schema that matches what is configured as the col name for corrupted records, and it will output the XML in that col for you too

sandeep-katta0102 commented 2 years ago

Whatever the mode I use result is same i.e. always author column is null

srowen commented 2 years ago

Hm yeah that's not quite the issue. I think the issue is the text alongside the child element in <title>. That does not work; for a discussion, see: https://github.com/databricks/spark-xml/issues/516

What I can't see immediately is why parsing does not fail with a custom schema. It's going to be related to the fact that it does not expect any text here.

I'll leave it open but I don't know how to fix it and dont' have time to debug, but I'd certainly look at more analysis or a pull request if anyone has bright ideas.