Closed mfawcett closed 4 years ago
You could already do this programmatically by inspecting the schema and selecting one nested column under a new name. I'm not sure how generally useful this would be?
This would be for the general case where you don't know the schema ahead of time and it might vary between rows/files.
Hm, how is that related? you'd be inferring the schema in that case, but this seems to be about modifying the natural schema to something flatter.
Apologies, I guess I'm not following you. Are you saying I could do this programmatically without spark-xml, or is there already a way with spark-xml to do this?
I mean something like .select($"foo.bar".alias("foo_bar"))
; that works right?
Is that possible via the Java API, or just the Scala API using schema_of_xml/from_xml ?
When I use new XmlReader().xmlDataset(context.sparkSession(), xml);
I end up with
one | a |
---|---|
<two>two</two> |
<two>2</two> |
What I'm suggesting is just a function of Spark, not this library, so works in any of its languages. It does not require specifying the schema. Are you specifying the schema? Otherwise I am not sure how you get that result, where the columns have XML.
I am not specifying the schema. Here is the full code:
final Dataset<String> xml = data.select("XML").map((Row row) -> {
return row.getString(0);
}, Encoders.STRING());
Dataset<Row> dataset = new XmlReader().xmlDataset(context.sparkSession(), xml);
Input: date | XML |
---|---|
1/1/2020 | <root><child><nested>value</nested></child></root> |
Output child |
---|
<nested>value</nested>
Hm, I'm actually not sure how that can work as you haven't set the rowTag or rootTag, which default to ROW and ROWS. Are you sure that isn't set, and/or can you try setting them appropriately to see if it changes the result? the output here puzzles me; it should be a nested struct, not leave the XML unparsed, in any event. child
isn't somehow a string field that contains escaped XML or something, right?
I don't see how to set the root tag using the Java API. Setting the row tag to "root" didn't have any effect on the output, nor did setting it to "child", although in this scenario it seems like "root" would be more correct since I don't know the schema ahead of time.
Ah yeah there is no method for it. But I'm mistaken anyway, and root tag won't matter here (only affects writing). Row tag should be 'root' or something, yeah.
And I actually also forgot that row tag won't matter here either as you are passing the XML that constitutes each row directly.
But I don't get the result you do. When I run that I get a value and schema of...
+-------+
| child|
+-------+
|[value]|
+-------+
root
|-- child: struct (nullable = true)
| |-- nested: string (nullable = true)
That looks correct. Are you definitely running just like that on the data <root><child><nested>value</nested></child></root>
?
yes, however there is a second row that doesn't match that schema, like
<root><child>value</child></root>
Oh that makes sense then. The only schema that works for both is child as string. But is this connected to your original question about flattening? Here there is no nested to flatten.
I should have put two rows of input XML in my OP.
Input:
row | XML |
---|---|
1 | <root><child>value</child></root> |
2 | <root><child><nested>value</nested></child></root> |
Output could be flattened using fully qualified element name as the column name, or it could be a single column "root" with struct type with null values for missing elements. Either way would be okay with me.
Hm, the only thing I think might work is to manually specify the schema and make child a struct with optional _VALUE string field and optional nested struct field.
We're not going to know the schema ahead of time. What if a new column was made per element with a value, using the element's fully qualified path as the column name?
I don't think the naming is the issue. It has to make sense out of the schema here if asked to do so automatically and the default is to resolve this as one column containing a string. Resolving as two cols, regardless of name, is different and you'd have to select that by controlling the schema.
What if you introduced the Scala APIs to Java?
val payloadSchema = schema_of_xml(df.select("payload").as[String])
val parsed = df.withColumn("parsed", from_xml($"payload", payloadSchema))
and I could specify the schema using the inferred schema.
You can already call these from Java; it's all in the JVM. You don't need these particular functions, you just need to specify the schema.
Do you have an example of using schema_of_xml
from Java using 0.9.0? I am unable to figure out how to import it. All of the classes import correctly, but schema_of_xml appears inaccessible.
Also, the whole point of attempting this route was because we are unable to specify the schema.
I was able to access them using "package$.MODULE$.", because the java code ends up with a "public final class package" which we obviously can't import.
Yes that's right. You'll find sometimes you have to look in a synthetic subclass that Scala generates to access some things. package objects are one of them.
So after messing around with those functions as well, it looks like there isn't any way to get the desired behavior. Can this be a feature request?
Input:
row | XML |
---|---|
1 | <root><child>value1</child></root> |
2 | <root><child><nested>value2</nested></child></root> |
Output:
root_child | root_child_nested |
---|---|
value1 | value2 |
The alternative would be to make schema_of_xml return the superset of the inferred schemas of the column values.
That's already how it works. The problem is that the 'superset' type is inherently ambiguous, because XML doesn't necessarily follow a tabular schema. The superset of 'string' and 'nested XML' is 'string', with the XML as string. That's a fine answer. It's not the only answer and not the one you're looking for here, which is 'struct, with option string field, and optional struct field'.
The problem is that these aren't even the only two possibilities, so a 'feature' that selected one or the other behavior doesn't really help the general issue.
The good news is that the library should be able to make sense of several valid schemas for the same data, hence I'm hoping that it's a matter of specifying the desired schema. The above is only a question of what schema is automatically inferred, not the one schema that can be applied.
I don't 100% know that and I can try a simple experiment later to see if that is how it would behave in this case, that it's doable with a custom schema. If it is, I'd probably say that's how it should be. If not, then maybe.
Input XML:
Output dataset: