ashtuchkin / iconv-lite

Convert character encodings in pure javascript.
MIT License
3.04k stars 282 forks source link

Encode with Shift JIS but receive EUCJP #270

Closed QuocNguyen799 closed 2 years ago

QuocNguyen799 commented 2 years ago

I want to encode this charater to Shift JIS: 髙 const encoded = iconv.encode('髙', "Shift_JIS") But i receive EUCJP instead of SJIS when i detect "encoded" above const detected = Encoding.detect(encoded); And the "encoded" that i receive is : 8de8 But it should be: 3f https://www.skandissystems.com/testCharset.pl image

ashtuchkin commented 2 years ago

Encoding detection is not precise, especially given a single character. Do you have any specific question to iconv-lite here?

On Fri, Jul 30, 2021, 01:14 QuocNguyen799 @.***> wrote:

I want to encode this charater to Shift JIS: 髙 const encoded = iconv.encode('髙', "Shift_JIS") But i receive EUCJP instead of SJIS when i detect the "encoded" above const detected = Encoding.detect(encoded);

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ashtuchkin/iconv-lite/issues/270, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEZKHIHVNNCUFLC4QG666DT2IYKRANCNFSM5BHZPCZQ .

QuocNguyen799 commented 2 years ago

Thanks for your reply, but it's not just that the detection is not precise, the encoding is also incorrect. This character 髙 should be '3f' when convert to shift_jis.

ashtuchkin commented 2 years ago

3f in Shift_JIS is just question mark "?". I assume it means that the script you're referring to doesn't know how to encode this character.

Also not sure where you're getting 8de8. On my machine I get bytes 0xFB 0xFC:

> iconv.encode('髙', "Shift_JIS")
<Buffer fb fc>

Checking in a recent browser that supports https://encoding.spec.whatwg.org/ (the main standard that iconv-list follows), I see that this is indeed a correct encoding:

let dec = new TextDecoder("Shift_JIS");
let buf = Uint8Array.from([0xfb, 0xfc]);
document.body.innerText = dec.decode(buf);  // shows "髙"
QuocNguyen799 commented 2 years ago

The question mark "?" or 3f is exactly what I need, because character belong to EUC_JP , not Shift_JIS. When i try it with php, it works $str = mb_convert_encoding('髙', "SJIS"); $str = mb_convert_encoding($str, "UTF-8", "SJIS"); var_dump($str); I don't know much about encoding standards. Maybe there is a difference in iconv-lite and php's encoding standards. Do you have any suggestions for this? If not, I will close this issue. And thank you for your time.

ashtuchkin commented 2 years ago

As far as I know, recent versions of Shift_JIS such as Shift_JIS-2004 can encode the characters that were previously only encodable with EUC_JP (see https://en.wikipedia.org/wiki/Shift_JIS#Shift_JISx0213_and_Shift_JIS-2004). I assume PHP does not support it, or is somehow more strict about using the older version of Shift_JIS?

Iconv-lite only supports the extended version of Shift_JIS. I don't think there's an easy way to restrict encoding to a strict Shift_JIS. One hack I can think of could be to replace all "unsupported" characters before encoding with an explicit "?", but that requires knowledge of all these unsupported chars.