databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Attribute values of nested fields are lost if option "attributePrefix" has empty value #651

Closed voban closed 11 months ago

voban commented 1 year ago

Hello! I am using Spark 3.2.1 (Scala 2.12) and Spark XML last version 0.16.0.

I am attempting read this xml file:

<catalog>
  <book id="100">
    <author>
      <info name="Jack"/>
    </author>
  </book>
</catalog>

I am using this code to do so:

Dataset<Row> dataset = spark.read().format("xml")
      .option("rootTag", "catalog")
      .option("rowTag", "book")
      .option("attributePrefix", "")
      .load("somefile.xml");
dataset.printSchema();
dataset.show(false)

I see the following result:

root
 |-- author: struct (nullable = true)
 |    |-- info: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- name: string (nullable = true)
 |-- id: long (nullable = true)
+------+---+
|author|id |
+------+---+
|{null}|100|
+------+---+

The schema of the dataset is correct, but the content is obviously displayed incorrectly. The value "Jack" of the attribute "name" is lost.

If you do everything the same, only without the empty value "" of the option "attributePrefix", then both the schema and the content will be correct:

Dataset<Row> dataset = spark.read().format("xml")
      .option("rootTag", "catalog")
      .option("rowTag", "book")
      .load("somefile.xml");
dataset.printSchema();
dataset.show(false)
root
 |-- _id: long (nullable = true)
 |-- author: struct (nullable = true)
 |    |-- info: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _name: string (nullable = true)

+---+--------------+
|_id|author        |
+---+--------------+
|100|{{null, Jack}}|
+---+--------------+

I need to read xml file with a required empty attribute prefix. After studying the source code of your project, I suggest to fix lines 160-162 of the StaxXmlParser.scala. Before:

val attributesOnly = st.fields.forall { f =>
          f.name == options.valueTag || f.name.startsWith(options.attributePrefix)
}

After:

val isPrefixEmpty = options.attributePrefix.isEmpty
val attributesOnly = st.fields.forall { f =>
          f.name == options.valueTag || !isPrefixEmpty && f.name.startsWith(options.attributePrefix)
}

This fix helps me and at the same time doesn't break any of your unit tests.

Thanks and regards!

srowen commented 1 year ago

Just don't set the attribute prefix? I don't see why that is necessary. You can rename the col as you like. Does your change make all tests pass though?

voban commented 1 year ago

In our project, we load BigData files with a complex structure (arrays and structures up to 10 levels of deep). It seems to me that it is technically more difficult to rename nested fields in a dataset with a complex structure than not to add prefix immediately when reading. Yes, my change make all tests pass though.

srowen commented 1 year ago

It may lead to some ambiguities, but I suppose it's fine to try to allow it. Open a PR with the change.