databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

XML parser behaves differently for StringType field when custom schema is used #608

Closed atomobianco closed 1 year ago

atomobianco commented 1 year ago

Observing some unexplained behavior with/without a schema definition when reading. Maybe something in common with this issue. My understanding is that the method StaxXmlParserUtils#currentStructureAsString should be called whenever it's converting a field declared as StringType. The problem is that the method does not get called when providing a schema.

I took the first test from XmlSuite that calls this method. When enforcing a schema, the method is no longer being used.

  test("DSL test with mixed elements (struct, string)") {
    val schema = buildSchema(
      field("age", IntegerType),
      struct("name", field("firstName"))
    )
    val results = spark.read
      .option("rowTag", "person")
      .schema(schema)
      .xml(resDir + "ages-mixed-types.xml")
      .collect()
    assert(results.length === 3)
  }

It seems that when using a schema, we never enter the case within convertComplicatedType

...
case _: StringType => StaxXmlParserUtils.currentStructureAsString(parser)

Maybe I defined a wrong schema?

srowen commented 1 year ago

Can you give an example? what you're saying sounds right but what isn't working as you expect?