databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Fix for xml expression to not parse arbitrary strings #679

Closed xanderbailey closed 4 months ago

xanderbailey commented 4 months ago

Previously it was the case that we would parse the string of the column as the column expression argument for the expression. This leads to being able to execute arbitrary spark SQL if you parse the expression a string literal. It also means that string literals don't work with this expression and it doesn't work within an array transform expression either.

srowen commented 4 months ago

This project is now incorporated into Apache Spark. Could you open the pull request there? if it's accepted, I will back-port it here just in case.

Can you give an example of what no longer works (and should not), and/or what didn't work before and does now?

xanderbailey commented 4 months ago

Thanks for the quick response @srowen! Looks like it's being added to spark 4.0 here https://github.com/apache/spark/blame/29d077fbbd5464f64e0eeb495f7a955850915cc5/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L7146? Reading this path it seems like it isn't using the same constructor as this library so I think we should be good to just back port this here?

I can't think of an example of something that doesn't work now that would have worked before...

Cases that previously didn't work and do now:


from_xml(new Literal("<root><name>foo</name></root>"), schema) <- string literal is read by the SQL parser as sql which is clearly isn't.
functions.transform(new Column("array"), (element, _index) -> from_xml(element, schema)) <- the variable reference is removed here when you use the SQL parser so this throws when planning.
xanderbailey commented 4 months ago

@srowen is there a way for us to tag a new version for this fix?

srowen commented 4 months ago

Possibly, but does this block anything? it seems like the issues it avoids are things the caller can just not do, or do I misunderstand

xanderbailey commented 4 months ago

I got around this by under the underlying spark expression rather than using functions so yes there's a work around but it's causing unexpected errors at the moment when using string literals and when used within an array transform.

srowen commented 4 months ago

I just released 0.18.0 with this change: https://github.com/databricks/spark-xml/releases/tag/v0.18.0