databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Allow xpath for rowTag #618

Closed singlewind closed 1 year ago

singlewind commented 1 year ago

We are trying to fetch elements who may have duplicate tag names. Below is the sample.

<books>
  <book>
    <id>1</id>
    <name>book 1</name>
  </book>
  <book>
    <id>2</id>
    <name>book 2</name>
    <ref>
       <book>1</book>
    </ref>
  </book>
</books>

When we use rowTag as book. It will fetch 3 rows. I wonder whether we have a solution for this?

srowen commented 1 year ago

That shouldn't be a problem. book will become a column inside the ref struct. I tried it now and it works fine. I get two rows, and a schema like

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- ref: struct (nullable = true)
 |    |-- book: long (nullable = true)
singlewind commented 1 year ago

@srowen, you are right. I may can use schema to remove the unwanted element. I will try. Thank you.

srowen commented 1 year ago

No, I'm saying it works as-is. I do not see the behavior you observe.