How to pass a Dataset[String] to spark xml

omjcnkn commented 3 years ago

We're trying to parse a XML column from a table using the latest version of spark-xml, so we extract that column into a Dataset[String]. We pass the converted dataset to the spark.read.xml() method, however our passed options doesn't seem to be used whenever we pass a Dataset[String], meaning that it ignores our passed options and it falls back to the default ones.

Here's how we convert the extracted column to a Dataset[String] val xmlCol = srcDf.select("XMLRECORD").as[String]

Here's how we call spark-xml read() method

val df = spark.read
      .option("inferSchema", "false")
      .option("rowTag", "row")
      .option("columnNameOfCorruptedRecord", "bad_record")
      .schema(schema)
      .xml(xmlCol)

srowen commented 3 years ago

@HyukjinKwon sorry to bother but could you check my logic?

I'll admit, I didn't know this worked! there is an implicit XmlDataFrameReader that basically causes this to call new XmlReader().xmlDataset(spark, xmlDataset). I think that's actually problematic because indeed it passes no options through. There isn't a way to pull the options from the Spark DataFrameReader to pass along, not that I can see.

As a result I wonder if we should deprecate this implicit as it's confusing.

You can however just use new XmlReader().xmlDataset(spark, xmlDataset) directly. XmlReader exposes a number of withX setters to set options. But there again they're not exhaustive, and kind of redundant with just passing a Map of options. (They do however cover the options you're trying to set). But there's no way to pass that Map!

So I also wonder if we should deprecate the withX methods and add a Map argument to the constructor for options.

Note that XMLRECORD would have to be the individual "row" XML strings, not lines of an XML doc or entire XML docs, for this work. That may already be what you have, so, you may have a way forward right now.

But I think we need to clean up this path.

omjcnkn commented 3 years ago

Thank you so much @srowen, your efforts are highly appreciated :)

You can however just use new XmlReader().xmlDataset(spark, xmlDataset) directly. XmlReader exposes a number of withX setters to set options. But there again they're not exhaustive, and kind of redundant with just passing a Map of options. (They do however cover the options you're trying to set). But there's no way to pass that Map!

Yes we tried that, but the problem is we wanted to pass a dateFormat as well and there were no way of doing that with the current implementation. I agree with you passing a Map makes a lot more sense :)

Note that XMLRECORD would have to be the individual "row" XML strings, not lines of an XML doc or entire XML docs, for this work. That may already be what you have, so, you may have a way forward right now.

Yup, we'll stick with reading using the XmlReader for now, then update once the new version gets released :)

Once again, thank you so much for your efforts :)

databricks / spark-xml

How to pass a Dataset[String] to spark xml #527