albfernandez / juniversalchardet

Originally exported from code.google.com/p/juniversalchardet
Other
339 stars 60 forks source link

Chinese Internal Code Specification (GBK) not supported? #34

Closed liyujiang-gzu closed 3 years ago

liyujiang-gzu commented 4 years ago

After testing, GB2312 also can not be detected

liyujiang-gzu commented 4 years ago

By the way, use http://jchardet.sourceforge.net can detected

siqiniao commented 4 years ago

hope support , offcen use gbk ....

albfernandez commented 4 years ago

Hi I don't know how to work (or test) with Chinesse Charsets. GB18030 was disabled in 2.0 to fix #9 and #11. Can you try version 1.0.3 and check if it works for you. Then we can try to reenable.

liyujiang-gzu commented 4 years ago

Hi I don't know how to work (or test) with Chinesse Charsets. GB18030 was disabled in 2.0 to fix #9 and #11. Can you try version 1.0.3 and check if it works for you. Then we can try to reenable.

Thank you for your reply.

amake commented 4 years ago

It's easy to create a file in GBK:

  1. Copy some Chinese text from e.g. https://zh.wikipedia.org/wiki/%E6%B1%89%E5%AD%97%E5%86%85%E7%A0%81%E6%89%A9%E5%B1%95%E8%A7%84%E8%8C%83
  2. Paste into an Emacs buffer visiting e.g. ~/gbk-sample.txt
  3. Do M-x set-buffer-file-coding-system gbk and save

The result is attached. gbk-sample.txt

liyujiang-gzu commented 4 years ago

gb2312-sample.txt

amake commented 4 years ago

I can confirm that juniversalchardet correctly detects the sample file after uncommenting the line here:

$ mvn test # fails because of false-positive unit tests
$ java -cp target/classes:target/test-classes org/mozilla/universalchardet/example/TestDetector gbk-sample.txt
Detected encoding = GB18030

(It also works for the file posted by @liyujiang-gzu)

I'm not sure what should be done to prevent the false positives. I checked and the texts in both of the unit tests don't seem to be valid GBK.

liyujiang-gzu commented 4 years ago

GB 18030 is compatible with GBK and GBK is compatible with GB2312.

dwhmofly commented 3 years ago

GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312.[1] As a Unicode Transformation Format[a] (i.e. an encoding of all Unicode code points), GB18030 supports both simplified and traditional Chinese characters. It is also compatible with legacy encodings including GB2312, CP936,[b] and GBK 1.0.

https://en.wikipedia.org/wiki/GB_18030

@albfernandez @amake Chinese just need GB18030