Can't Parse Encoding With Non Ascii Characters

xshenigx1 commented 3 years ago

Hello Spark-xml team,

We have been using spark-xml a lot, thanks for the package!

Recently trying spark-xml with ISO_8859-1 encoded XML that contains Nordic characters, e.g

name="Företagsinteckningar"

Could not get the Nordic characters to parse correctly, despite setting the charset to ISO_8859-1.

Traced the issue to this method in com.databricks.spark.xml.util.XmlFile:

def withCharset( context: SparkContext, location: String, charset: String, rowTag: String): RDD[String] = { // This just checks the charset's validity early, to keep behavior Charset.forName(charset) context.hadoopConfiguration.set(XmlInputFormat.START_TAG_KEY, s"<$rowTag>") context.hadoopConfiguration.set(XmlInputFormat.END_TAG_KEY, s"</$rowTag>") context.hadoopConfiguration.set(XmlInputFormat.ENCODINGKEY, charset) context.newAPIHadoopFile(location, classOf[XmlInputFormat], classOf[LongWritable], classOf[Text]).map { case (, text) => new String(text.getBytes, 0, text.getLength, charset) } }

Problem with (_, text) => new String(text.getBytes, 0, text.getLength, charset) seems to be that the text already contains UTF8 encoded bytes, populated by Xml Parser, it is not charset encoded.

After changing above code to (_, text) => text.toString(), the Nordic characters parsed correctly.

This seems to be a common problem for parsing xml with non ascii characters, would be great to get the fix into the package.

Thanks

srowen commented 3 years ago

What characters are you talking about? I don't think those are in ISO-8859-1

xshenigx1 commented 3 years ago

It seems those characters got truncated on posting, trying again

name="Allmänna företagsuppgifter"

srowen commented 3 years ago

You'll need to use UTF-8 - this and much of Hadoop/Spark assume that pretty deeply for text files. The charset param is actually pretty misleading here - it just affects things like how the row tag and line separators are interpreted as bytes.

xshenigx1 commented 3 years ago

In this case though the Xml input reader has already converted the Xml to UTF-8 text, while the code above (_, text) => new String(text.getBytes, 0, text.getLength, charset) is written as though the text could contain charset bytes, that seems inconsistent with the assumption above? So what issues do you see with the fix?

srowen commented 3 years ago

I'm not sure it has. It read something as UTF-8, gave you the bytes, and you tried to interpret it as a different charset. That won't necessarily work. Are you sure the text is coming through as you intend there?

xshenigx1 commented 3 years ago

Not sure which code you are referring to now?

This is the code in Spark Xml source today:

classOf[Text]).map { case (_, text) => new String(text.getBytes, 0, text.getLength, charset)

Like you say the text is UTF-8, but this code "tried to interpret it as a different charset. That won't necessarily work"

My fix is:

"After changing above code to (_, text) => text.toString(), the Nordic characters parsed correctly."

Please see initial description of the case.

srowen commented 3 years ago

Ah right, I had forgotten this is what XmlFile does. Hm, I'm actually not sure where this new String(..., charset) construction originated, as that should not work unless charset is UTF-8 (except in special cases like ASCII). Similar Spark code just calls .toString too which seems more natural.

I guess I'm still surprised if it works, if the Hadoop TextInputFormat is reading your data as UTF-8, but maybe it is otherwise converted by those APIs or something. OK I will make that change - seems better in any event and if it fixes this case, good.

xshenigx1 commented 3 years ago

The XmlFile is using XmlInputFormat here, XmlInputFormat seems to be converting the data from charset to UTF-8.

srowen commented 3 years ago

Ah right, I've forgotten so much about how this works. Of course, right. Then that is definitely the right fix, I'll commit it after tests pass.

xshenigx1 commented 3 years ago

Thank you!

databricks / spark-xml

Can't Parse Encoding With Non Ascii Characters #510