databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
500 stars 226 forks source link

Allow empty strings for attributePrefix when reading XML #570

Closed dhimmel closed 2 years ago

dhimmel commented 2 years ago

Setting .option("attributePrefix", "") when reading XML causes the following error:

IllegalArgumentException: requirement failed: 'attributePrefix' option should not be empty string.

This check was added in https://github.com/databricks/spark-xml/pull/170 in response to https://github.com/databricks/spark-xml/issues/153 with the following comment:

Since those options are used to differentiate attributes, attributePrefix should not be just a empty string

I'm not sure I understand what is bad about about having an empty string for attributePrefix. For example, I'm reading XML where I know that none of the attribute names collide with element names. And providing a prefix for attributes doesn't absolutely prevent collisions with elements, right?

srowen commented 2 years ago

Hm. Maybe. It would make it impossible to distinguish when writing rather than reading, so XML wouldn't round trip correctly. But you're just reading. You can easily rename the cols in a one liner to remove the prefix. I'm not against it as it would have to be an explicit choice just not sure if it causes something else to break...

dhimmel commented 2 years ago

It would make it impossible to distinguish when writing rather than reading, so XML wouldn't round trip correctly

Yeah, but there'd still be the attributePrefix can't be empty error upon writing, so it would be obvious that it's not roundtrippable. For my application, I'm solely interested in the getting the data out of XML!

You can easily rename the cols in a one liner to remove the prefix.

I don't think it's that easy with deeply nested structs as per these posts.

srowen commented 2 years ago

Fair point about structs and being able to control this differently on write. I'm OK with allowing empty on read. Do you have a PR? I can try it next week otherwise

srowen commented 2 years ago

Resolved in #571

dhimmel commented 2 years ago

Thanks @srowen for https://github.com/databricks/spark-xml/pull/571. Sorry I didn't get around to responding to your question about submitting a PR. Turns out I don't really know Scala, so it wasn't the easiest for me.

Your help is much appreciated!