databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Extract multiple tables from the same XML file #668

Closed vwiencek closed 10 months ago

vwiencek commented 10 months ago

Hello,

I would like to parse this XML file and obtain 2 spark Dataset:

I'm having issues reading this kind of XML to obtain two distinct spark dataframes

<Root>
  <Report>
      <Table1>
          <Obs field1="a" field2="b"></Obs>
          <Obs field1="a" field2="b"></Obs>
          <Obs field1="a" field2="b"></Obs>
          <Obs field1="a" field2="b"></Obs>
      </Table1>
      <Table2>
          <Obs field3="a" field4="b"></Obs>
          <Obs field3="a" field4="b"></Obs>
          <Obs field3="a" field4="b"></Obs>
          <Obs field3="a" field4="b"></Obs>
      </Table2>
  <Root>
<Report>```

My code actualy looks like 
  val table1 = session.read
    .format("com.databricks.spark.xml").option("rootTag","Table1").option("rowTag","Obs")
    .load("file.xml")

  val table2 = session.read
    .format("com.databricks.spark.xml").option("rootTag","Table1").option("rowTag","Obs")
    .load("file.xml")


but instead of 2 distinct tables with 4 record with each 2 columns, I get each time 1 tables with 8 records and 4 columns, as if it didn't take only rows under the root tag ....
srowen commented 10 months ago

rootTag does nothing here. You want to set rowTag to Table1 in one case, and Table2 in the other. You get an array-valued Obs column then. You can also set Root to the rowTag and get both at once.