Closed omjcnkn closed 3 years ago
@HyukjinKwon sorry to bother but could you check my logic?
I'll admit, I didn't know this worked! there is an implicit XmlDataFrameReader
that basically causes this to call new XmlReader().xmlDataset(spark, xmlDataset)
. I think that's actually problematic because indeed it passes no options through. There isn't a way to pull the options from the Spark DataFrameReader
to pass along, not that I can see.
As a result I wonder if we should deprecate this implicit as it's confusing.
You can however just use new XmlReader().xmlDataset(spark, xmlDataset)
directly. XmlReader
exposes a number of withX
setters to set options. But there again they're not exhaustive, and kind of redundant with just passing a Map
of options. (They do however cover the options you're trying to set). But there's no way to pass that Map
!
So I also wonder if we should deprecate the withX
methods and add a Map
argument to the constructor for options.
Note that XMLRECORD
would have to be the individual "row" XML strings, not lines of an XML doc or entire XML docs, for this work. That may already be what you have, so, you may have a way forward right now.
But I think we need to clean up this path.
Thank you so much @srowen, your efforts are highly appreciated :)
You can however just use new XmlReader().xmlDataset(spark, xmlDataset) directly. XmlReader exposes a number of withX setters to set options. But there again they're not exhaustive, and kind of redundant with just passing a Map of options. (They do however cover the options you're trying to set). But there's no way to pass that Map!
Yes we tried that, but the problem is we wanted to pass a dateFormat
as well and there were no way of doing that with the current implementation. I agree with you passing a Map
makes a lot more sense :)
Note that XMLRECORD would have to be the individual "row" XML strings, not lines of an XML doc or entire XML docs, for this work. That may already be what you have, so, you may have a way forward right now.
Yup, we'll stick with reading using the XmlReader
for now, then update once the new version gets released :)
Once again, thank you so much for your efforts :)
We're trying to parse a XML column from a table using the latest version of spark-xml, so we extract that column into a
Dataset[String]
. We pass the converted dataset to thespark.read.xml()
method, however our passed options doesn't seem to be used whenever we pass aDataset[String]
, meaning that it ignores our passed options and it falls back to the default ones.Here's how we convert the extracted column to a
Dataset[String]
val xmlCol = srcDf.select("XMLRECORD").as[String]
Here's how we call spark-xml
read()
method