Splitting XML into single-column rows

rjrudin commented 9 months ago

I'm wondering if the equivalent of the wholetext feature in Spark's text support could be made available when reading XML files. The use case would be:

I have a large XML file with N child elements, where each child element should become a row.
I want a single-column row with a "value" column that contains the unmodified XML for each child element.
My writer can then do XML-ish things on the XML element in each row.

Basically, I'm looking to take advantages of spark-xml's splitting capabilities but without converting each split element into a structured row.

srowen commented 9 months ago

You can always read the XML files as text, already, and then parse them as you like by hand. But you can also declare any parent element's schema to be type 'string' and then you'll get the XML body as a string as its value rather than parse it.

rjrudin commented 9 months ago

Thanks @srowen - I verified I could use "text" and get the contents of the file as-is. However, I'd like to do the splitting while reading the file, as spark-xml does today.

Thanks for the suggestion too, though I'm having difficulty applying what you said and I'm likely doing something wrong. I have this as a simple test file:

<parent>
  <child>
    <text>first</text>
    <age>10</age>
  </child>
  <child>
    <text>second</text>
    <age>20</age>
  </child>
</parent>

And a simple test program:

Dataset<Row> dataset = newSparkSession().read()
            .format("xml")
            .schema(new StructType().add("child", DataTypes.StringType))
            .option("path", "src/test/resources/aggxml/*.xml")
            .option("rowTag", "child")
            .load();

System.out.println(dataset.schema());
dataset.collectAsList().forEach(row -> System.out.println(row.prettyJson()));

I'm wondering if by "schema" you meant the Spark schema or an XML schema? What you describe - getting the XML body as a string - is exactly what I want, but I'm trying to do that for each "child" in the test file such that I'd have a row with an e.g. "value" column with a value of <child><text>first</text><age>10</age></child>.

srowen commented 9 months ago

If you want "child", then rowTag has to be "parent" I believe. Here, the type would be array of string, as there are multiple child elements

rjrudin commented 9 months ago

Unfortunately no luck so far - I tried:

Dataset<Row> dataset = newSparkSession().read()
            .format("xml")
            .schema(new StructType().add("child", new ArrayType(DataTypes.StringType, false)))
            .option("path", "src/test/resources/aggxml/*.xml")
            .option("rowTag", "parent")
            .load();

But that results in the following:

{
  "child" : [ "\n    <text>first</text>\n    <age>10</age>\n  ", "\n    <text>second</text>\n    <age>20</age>\n  " ]
}
{
  "child" : null
}

I tried replacing "child" with "parent" in the schema definition, but that results in:

{
  "parent" : null
}
{
  "parent" : null
}

Any other ideas? While I'm basically looking for an optimization over using the existing "text" data source and then applying StaX against the in-memory string contents of an XML file, it seems a lot cleaner to me - and presumably at least slightly more efficient - to read an XML file with N elements in it and produce a dataset containing N rows, with each row having a single column of type string.

srowen commented 9 months ago

That works correctly for me and yields:

+---------------------------------------------------------------------------------------------------+
|child                                                                                              |
+---------------------------------------------------------------------------------------------------+
|[\n    <text>first</text>\n    <age>10</age>\n  , \n    <text>second</text>\n    <age>20</age>\n  ]|
+---------------------------------------------------------------------------------------------------+

Two array elements with the body of the child tags. I'm not exactly sure what you're printing there. Make sure to use the latest version I guess, that's what I'm using.

rjrudin commented 8 months ago

Closing, as what I really want to do is have each child element become its own row.

My team is looking at how spark-xml is implemented, as we are trying to do something similar with using Stax to process an XML file and emit a Spark row for each user-defined child element. If you happen to know, is this the code from XmlRelation that does all the magic that allows for local/s3/etc paths to be accessed easily?

    val filesystemPath = location match {
      case Some(p) => new Path(p)
      case None =>
        throw new IOException(s"Cannot INSERT into table with no path defined")
    }

    val fs = filesystemPath.getFileSystem(sqlContext.sparkContext.hadoopConfiguration)

We basically want to reuse Spark's support for reading files from various filesystems, and then each for each file, we'll run it through our own XML splitting code to generate a row per child element.

srowen commented 8 months ago

If you just want to turn child, which is an array of N strings of XML, into N rows, the just use the explode function on child.

rjrudin commented 8 months ago

Thanks @srowen - sorry to bother on this ticket, but hoping one last question - do you know of a way to extract the child elements such that the array of N strings includes the <child> and </child> elements? I believe there's a way I can prepend/append those strings - possibly as part of the explode - but just curious if it's possible to do via the rowTag and schema configuration.

Also, anything to be concerned about in terms of performance here for something like a 1gb file with 100k matching elements in it? I'm not experienced enough with Spark yet to know what the costs associated with explode are - i.e. if Spark is having to first construct an array with 100k strings in it, and then do work to create 100k rows from that array. Just not sure if that's considered a scalable / efficient approach.

srowen commented 8 months ago

No, that is not part of the 'value' of the element, but yes you can just put them back on for your purposes. explode and then concat values onto the resulting column, sure. I don't think it's a particularly expensive operation here, no

databricks / spark-xml

Splitting XML into single-column rows #672