databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Reader can't read XML file if the rootTag and rowTag are the same #645

Closed irajhedayati closed 1 year ago

irajhedayati commented 1 year ago

Sample file

<?xml version="1.0" encoding="UTF-8"?>
<item>
    <item>
        <id>848793049376</id>
        <item_group_id>3409927602276</item_group_id>
    </item>
    <item>
        <id>190738186012</id>
        <item_group_id>3409927602276</item_group_id>
    </item>
</item>

Code to read:

spark
      .read
      .option("mode", "FAILFAST")
      .option("inferSchema", false)
      .option("rootTag", "item")
      .option("rowTag", "item")
      .xml("test-same-root-row.xml")

Output:

struct<item:array<struct<id:string,item_group_id:string>>>
[ArrayBuffer([848793049376,3409927602276], [190738186012,3409927602276])]

So, it sees the root item tag as the row tag and parses one row where it has a single field of item, which is an array.

Expected:

struct<id:string,item_group_id:string>
[3409927602276,3409927602276]

this file is actually two records contained in the rootTag of item.

Having the same name as rootTag and rowTag seems to be a problem

srowen commented 1 year ago

rootTag isn't used for reading, only writing. But yes that is your error, that you have two different elements with the same name, and rowTag of 'item' is ambiguous.

irajhedayati commented 1 year ago

I see; thanks for the response!

Wouldn't it also be helpful to leverage rootTag in reading? The file may have a structure with different ROWs in different root tags. The same mindset that introduced rootTag in writing can also apply to reading.

Say that if I write a set of book records in an XML file using spark-xml, it has the following format because I asked for it

<books>
  <book></book>
  <book></book>
</books>

But I can't provide the root tag when I'm reading the same file using the same library.

srowen commented 1 year ago

It's not impossible, just more complicated to support (the first phase is to detect the relevant row elements without parsing XML for efficiency, so, this gets hard to handle, looking for nested tags). In contrast I think it's pretty unusual to have the same tags used at different levels to mean different things

irajhedayati commented 1 year ago

I agree with the unusual part. The client is sending this file to us, and we don't have control over it. It may get very complex when parsing a file. I appreciate you elaborating on why this feature is not implemented. However, I see that there are different use cases that include rootTag in reading.

e.g.

do you think that might come into the roadmap any time soon?

srowen commented 1 year ago

I don't think those are actually use cases for rootTag in reading.

I don't think rootTag is the right idea here anyway; imagine you have <item> somewhere besides the root. Or it's triply-nested. To actually resolve this you'd need some path syntax or something and a new tag.

No roadmap here, I don't think anyone would work on it anyway.

You can accept that it will read the parent <item> tag, which means it'll read a lot more into the resulting schema than you want, but, if it's not huge then maybe that's fine, you take the overhead and ignore the rest and pull out the actual item array field you care about.

Is it feasible to transform the XML to remove this ambiguity before parsing?

irajhedayati commented 1 year ago

Thanks for explaining. I went with a hack 😆

    if (conf.fileFormatAttributes.topRoot == conf.fileFormatAttributes.rowRoot) {
      original
        .select(explode(col(conf.fileFormatAttributes.rowRoot)))
        .select(col("col.*"))
    } else
      original