aadsm / jschardet

Character encoding auto-detection in JavaScript (port of python's chardet)
GNU Lesser General Public License v2.1
706 stars 97 forks source link

GB18030 encoded file incorrectly detected as gb2312 #49

Open wesinator opened 5 years ago

wesinator commented 5 years ago

https://github.com/atom/encoding-selector/issues/65

Steps to Reproduce

https://github.com/malice-plugins/yara/blob/17a4fc946febe8b002e285f591bcb21b92a99e9e/rules/userdb_panda.yar

Expected behavior: Detects the encoding of the file as GB18030. iconv -f GB18030 -t UTF-8 userdb_panda.yar works

Actual behavior: Atom auto detects the encoding as gb2312, 'undefined encoding' atom_gb2312_undefined

iconv fails to convert from GB2312, but works with GB18030:

iconv -f GB2312 -t UTF-8 userdb_panda.yar
iconv: illegal input sequence at position 29230

Reproduces how often: Always

byyxx128 commented 4 years ago

Glad to see you.

I'm just a general user rather than official maintainer. So I just share some of my ideas here.

GB 2312GBKGB 18030

(By the way, the standard GB 2312-1980 had been renamed to GB/T 2312-1980 in 2017.)

For standard documents they are: GB/T 2312-1980 ⊊ GBK 1.0 ⊊ GB 18030-2000 ⊊ GB 18030-2005

The latest effective standard is GB 18030-2005. All of the rest were replaced.

Maybe it is hard to identify if a file is encoded in GB 18030 (unless it has unique characters of GB 18030).

For example, if I create a file in GB 18030 and input some characters from CJK Unified Ideographs Extension B, which has been included in GB 18030-2005, it cannot be decoded correctly by encode guess.

https://github.com/microsoft/vscode/issues/33720