ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.04k stars 282 forks source link

Wrong maccyrillic decoding #297

Open batyshkaLenin opened 1 year ago

batyshkaLenin commented 1 year ago

In this encoding after the character ю there is a symbol ¤. Because of this, in places where there should have been the letter "я" is decoded symbol "€" (last symbol). изображение

ashtuchkin commented 1 year ago

Hmm I see the letter я at 0xDF, could it be intentional?

ashtuchkin commented 1 year ago

Also ¤ is the "current currency" symbol AFAIK, so I think it should be converted to Euro as expected. Let me know if it's a wrong assumption.

batyshkaLenin commented 1 year ago

The problem is that the decoding is going wrong. If you write a maccyrillic decoding test, instead of the letter я you get an ¤. The letter я is not a symbol of ¤. You are correct, it is a currency symbol.

ashtuchkin commented 1 year ago

Note that iconv-lite here uses generated data from the low-level iconv library, which is an informal standard for character encoding conversion, so I tend to trust it unless there's compelling data that it's wrong.

ashtuchkin commented 1 year ago

Wait, what do you expect the code for this letter be - 0xFF or 0xDF?

batyshkaLenin commented 1 year ago

The code for this letter should be 0xDF, but when decoding it translates as 0xFF. I don't know how to prove that this is true, except that I enter the letter я in Numbers on MacOS, and after decoding it turns into ¤, even though it should remain я.

batyshkaLenin commented 1 year ago

As a test, you can write a test for this encoding, as well as other Cyrillic encodings.

ashtuchkin commented 1 year ago

Well, if you can debug print the Buffer that is sent to the decode() method, we can check which byte corresponds to я there and potentially add a test. Iconv-lite is pretty thoroughly tested already, but it uses either iconv library or WHAT-WG as the "ground truth". These sources might be wrong but it's pretty rare.

batyshkaLenin commented 1 year ago

\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xdf must be equivalent to АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя, but it's not. Or am I misunderstanding something?

ashtuchkin commented 1 year ago

Just checked it and looks correct:

$ node
> iconv = require("iconv-lite")
> iconv.decode(Buffer("\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8A\x8B\x8C\x8D\x8E\x8F\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xdf", "binary"), "maccyrillic")
'АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЪЫЬЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюя'
ashtuchkin commented 1 year ago

Where are you getting the wrong results?