databricks / spark-xml

XML data source for Spark SQL and DataFrames
Apache License 2.0
499 stars 226 forks source link

Problem with reading cp1251 file #655

Closed VladIsLuve closed 1 year ago

VladIsLuve commented 1 year ago

Hi. I have a problem with parsing an XML document in pyspark using spark-xml API (pyspark 2.4.0). I have a file with cyryllic content with the following opening tag:

<?xml version='1.0' encoding='WINDOWS-1251'?>

So when I try to open it with some text editor with windows-1251 encoding, I can see the cyryllic text. But when I try to read an XML file into spark dataframe using this command:

python df = spark.read.format('xml').options(rootTag='products', rowTag='product', charset="cp1251").load('./cp1251')

I see strange symbols instead of cyryllic. But when I convert this file to UTF-8 encoding, the same command works correctly and I get spark dataframe with cyryllic symbols. I find this fact strange, because opening tag points on cp1251 encoding and content in this encoding is actually written in this encoding, but it works correctly only when it is interpreted as UTF-8.

My question: is there any ideas why does it happen? Is there a way to read a cp1251 xml file directly into dataframe without converting it to UTF-8? If there is no way, I would like to know about reasons of this behaviour.

P. S. If it is important, XML file is nested, cyryllic content is somewhere inside several levels of product-block.

P. P. S. If I give no information about encoding in SparkDataframeReader options, it will read file in UTF-8, don't taking into account an information in the opening tag. It is also strange, if you know something about it, please give me some piece of information.

srowen commented 1 year ago

I think only UTF-8 is supported. The underlying text libraries used here from Hadoop only do UTF-8. If you can convert the encoding, yeah that should work. cp1251 declarations, etc are ignored.

VladIsLuve commented 1 year ago

I think only UTF-8 is supported. The underlying text libraries used here from Hadoop only do UTF-8. If you can convert the encoding, yeah that should work. cp1251 declarations, etc are ignored.

Oh, thanks a lot! Is it true because of using java StandardCharsets? They contain only UTF encodings, and as I understood in the source code there is no moments with importing other encodings

srowen commented 1 year ago

No, it's because Hadoop's TextFormat assumes UTF-8

VladIsLuve commented 1 year ago

Sorry, didn't notice. Yeah, in the documentation of Hadoop.Text class it is mentioned, that it supports UTF-8 only. Thanks again, have a nice day :)

VladIsLuve commented 1 year ago

Excuse me, one more question, please. I would like to clarify the moment when encoding is ignored: in parsing the opening tag or during handling options? As I understood, the second one, but there is an example of using this option with non-utf encoding https://kb.databricks.com/special-characters-in-xml

srowen commented 1 year ago

Hm, actually maybe it works if you set the 'charset' in the options. But it would not pay any attention to the xml directive's encoding. So I think the problem is Java support for the character encoding. I thought 1251 was built in though: https://docs.oracle.com/javase/8/docs/technotes/guides/intl/encoding.doc.html

Are you sure the DF contents are wrong, and it's not just a display issue in your terminal?

VladIsLuve commented 1 year ago

As far as I can see, reading and writing xml in cp1251 encoding leads to creating a file with missencoded symbols. I guess the file was written in utf8 despite of my declaration of output encoding because choosing cp1251 encoding in text editor changed non-latin symbols to some others again. So I think that content of df is also wrong. There is some parameter in ~/.sparkmagic/config.json which is called "pyspark_dataframe_encoding" and I tried both of values: "utf-8" and "cp1251", the result was the same. Unfortunately, I don't have an access to the Java environment variables, so I can't change encodings in Java