databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Error using from_xml with StructType for schema #686

Closed ianepreston closed 3 months ago

ianepreston commented 3 months ago

I've got a Dataframe with a column that contains strings of XML formatted data. There's quite a bit of variability to the schema of any individual record, but I have an XSD file that should be valid for any of them. I'm trying to use the StructType that I parsed out of the XSD as the schema argument in from_xml but I'm getting an error. The actual schema I'm using is huge but I can reproduce the issue with a smaller example as below:

First I make a sample dataframe that looks kind of like what I'm actually trying to work with in that it has a string column containing XML:

b1xml = """
<book id="bk103">
      <author>Corets, Eva</author>
      <title>Maeve Ascendant</title>
</book>
"""
b2xml = """
<book id="bk104">
    <author>Corets, Eva</author>
    <title>Oberon's Legacy</title>
</book>
"""
bookslist = [
    ("bk103", b1xml),
    ("bk104", b2xml),
]

raw_df = spark.createDataFrame(bookslist, ["bookid", "bookxmlstr"])

Next I make sure I can parse it by inferring schema from a specific record:

strparsed_df = (
    raw_df
    .withColumn("parsedxml", from_xml(raw_df.bookxmlstr, schema_of_xml(b1xml)))
)
strparsed_df.show()

This works as expected

Finally, I try and pass in a StructType as the schema parameter for from_xml:

bookschema = StructType([
    StructField("_id", StringType(), True),
    StructField("author", StringType(), True),
    StructField("title", StringType(), True),
])

parsed_df = (
    raw_df
    .withColumn("parsedxml", from_xml(raw_df.bookxmlstr, bookschema))
)
parsed_df.show()

This fails with the following error:

AnalysisException: [PARSE_SYNTAX_ERROR] Syntax error at or near '{'. SQLSTATE: 42601

The StructType exactly matches what I see if I look at the schema of strparsed_df so I don't think it's an issue with how I'm describing the struct.

I've tried this on databricks runtime 14.3, 15.1, and 15.2 and get the same result across them

ianepreston commented 3 months ago

Update - this only seems to happen in databricks clusters using shared access mode. If I create a cluster in single user access mode then I do not encounter the error

srowen commented 3 months ago

If you're on Databricks, then spark-xml is already integrated into Spark there, and you do not need to install a library. That could in fact be the problem, if that's what you're doing. Shared is not going to allow libraries to call JVM code, unless they're whitelisted

In any event this wouldn't be an issue with this llibrary, but something related to Databricks and the port of this code into Spark.