databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
504 stars 228 forks source link

Feature Request - flatten nested xml and use fully qualified element name as column name #444

Closed mfawcett closed 4 years ago

mfawcett commented 4 years ago

Input XML:

<root>
  <one>
    <two>two</two>
  </one>
  <a>
    <two>2</two>
  </a>
  <b>
    <c>
      <two>dos</two>
    </c>
  </b>
</root>

Output dataset:

root_one_two root_a_two root_b_c_two
two 2 dos
srowen commented 4 years ago

You could already do this programmatically by inspecting the schema and selecting one nested column under a new name. I'm not sure how generally useful this would be?

mfawcett commented 4 years ago

This would be for the general case where you don't know the schema ahead of time and it might vary between rows/files.

srowen commented 4 years ago

Hm, how is that related? you'd be inferring the schema in that case, but this seems to be about modifying the natural schema to something flatter.

mfawcett commented 4 years ago

Apologies, I guess I'm not following you. Are you saying I could do this programmatically without spark-xml, or is there already a way with spark-xml to do this?

srowen commented 4 years ago

I mean something like .select($"foo.bar".alias("foo_bar")); that works right?

mfawcett commented 4 years ago

Is that possible via the Java API, or just the Scala API using schema_of_xml/from_xml ?

When I use new XmlReader().xmlDataset(context.sparkSession(), xml); I end up with

one a
<two>two</two> <two>2</two>
srowen commented 4 years ago

What I'm suggesting is just a function of Spark, not this library, so works in any of its languages. It does not require specifying the schema. Are you specifying the schema? Otherwise I am not sure how you get that result, where the columns have XML.

mfawcett commented 4 years ago

I am not specifying the schema. Here is the full code:

        final Dataset<String> xml = data.select("XML").map((Row row) -> {
            return row.getString(0);
        }, Encoders.STRING());

        Dataset<Row> dataset = new XmlReader().xmlDataset(context.sparkSession(), xml);
Input: date XML
1/1/2020 <root><child><nested>value</nested></child></root>
Output child

<nested>value</nested>

srowen commented 4 years ago

Hm, I'm actually not sure how that can work as you haven't set the rowTag or rootTag, which default to ROW and ROWS. Are you sure that isn't set, and/or can you try setting them appropriately to see if it changes the result? the output here puzzles me; it should be a nested struct, not leave the XML unparsed, in any event. child isn't somehow a string field that contains escaped XML or something, right?

mfawcett commented 4 years ago

I don't see how to set the root tag using the Java API. Setting the row tag to "root" didn't have any effect on the output, nor did setting it to "child", although in this scenario it seems like "root" would be more correct since I don't know the schema ahead of time.

srowen commented 4 years ago

Ah yeah there is no method for it. But I'm mistaken anyway, and root tag won't matter here (only affects writing). Row tag should be 'root' or something, yeah.

And I actually also forgot that row tag won't matter here either as you are passing the XML that constitutes each row directly.

But I don't get the result you do. When I run that I get a value and schema of...

+-------+
|  child|
+-------+
|[value]|
+-------+

root
 |-- child: struct (nullable = true)
 |    |-- nested: string (nullable = true)

That looks correct. Are you definitely running just like that on the data <root><child><nested>value</nested></child></root>?

mfawcett commented 4 years ago

yes, however there is a second row that doesn't match that schema, like

<root><child>value</child></root>
srowen commented 4 years ago

Oh that makes sense then. The only schema that works for both is child as string. But is this connected to your original question about flattening? Here there is no nested to flatten.

mfawcett commented 4 years ago

I should have put two rows of input XML in my OP.

Input:

row XML
1 <root><child>value</child></root>
2 <root><child><nested>value</nested></child></root>

Output could be flattened using fully qualified element name as the column name, or it could be a single column "root" with struct type with null values for missing elements. Either way would be okay with me.

srowen commented 4 years ago

Hm, the only thing I think might work is to manually specify the schema and make child a struct with optional _VALUE string field and optional nested struct field.

mfawcett commented 4 years ago

We're not going to know the schema ahead of time. What if a new column was made per element with a value, using the element's fully qualified path as the column name?

srowen commented 4 years ago

I don't think the naming is the issue. It has to make sense out of the schema here if asked to do so automatically and the default is to resolve this as one column containing a string. Resolving as two cols, regardless of name, is different and you'd have to select that by controlling the schema.

mfawcett commented 4 years ago

What if you introduced the Scala APIs to Java?

val payloadSchema = schema_of_xml(df.select("payload").as[String])
val parsed = df.withColumn("parsed", from_xml($"payload", payloadSchema))

and I could specify the schema using the inferred schema.

srowen commented 4 years ago

You can already call these from Java; it's all in the JVM. You don't need these particular functions, you just need to specify the schema.

mfawcett commented 4 years ago

Do you have an example of using schema_of_xml from Java using 0.9.0? I am unable to figure out how to import it. All of the classes import correctly, but schema_of_xml appears inaccessible.

Also, the whole point of attempting this route was because we are unable to specify the schema.

mfawcett commented 4 years ago

I was able to access them using "package$.MODULE$.", because the java code ends up with a "public final class package" which we obviously can't import.

srowen commented 4 years ago

Yes that's right. You'll find sometimes you have to look in a synthetic subclass that Scala generates to access some things. package objects are one of them.

mfawcett commented 4 years ago

So after messing around with those functions as well, it looks like there isn't any way to get the desired behavior. Can this be a feature request?

Input:

row XML
1 <root><child>value1</child></root>
2 <root><child><nested>value2</nested></child></root>

Output:

root_child root_child_nested
value1 value2

The alternative would be to make schema_of_xml return the superset of the inferred schemas of the column values.

srowen commented 4 years ago

That's already how it works. The problem is that the 'superset' type is inherently ambiguous, because XML doesn't necessarily follow a tabular schema. The superset of 'string' and 'nested XML' is 'string', with the XML as string. That's a fine answer. It's not the only answer and not the one you're looking for here, which is 'struct, with option string field, and optional struct field'.

The problem is that these aren't even the only two possibilities, so a 'feature' that selected one or the other behavior doesn't really help the general issue.

The good news is that the library should be able to make sense of several valid schemas for the same data, hence I'm hoping that it's a matter of specifying the desired schema. The above is only a question of what schema is automatically inferred, not the one schema that can be applied.

I don't 100% know that and I can try a simple experiment later to see if that is how it would behave in this case, that it's doable with a custom schema. If it is, I'd probably say that's how it should be. If not, then maybe.