Joungkyun / libchardet

libchardet - Mozilla's Universal Charset Detector C/C++ API
Other
105 stars 33 forks source link

Japanese UTF-8 encoding detected as TIS-620 (Windows-874 (Thai)) #16

Open hpwamr opened 4 years ago

hpwamr commented 4 years ago

Hello, For the development of Notepad3, we use the UCHARDET Charset Detector.

In issue #1831 we are faced with a problem of poor Japanese "UTF-8" detection which is detected as: TIS-620 (Windows-874 (Thai)) with reliability level of 99% by UCHARDET. 😕

These text editors detect it as UTF-8 and displays it correctly

Here the bad detection as "TIS-620"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "ใƒ†ใ‚นใƒˆใ€‚",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

Here the correct detection as "UTF-8"

{
  "manifest_version": 2,
  "name": "k view",
  "version": "0.5",
  "description": "テスト。",
  "browser_action": {
    "default_icon": { "19": "round-done-button.png" }
  },
}

In attachment the original sample: Error Detection encoding_utf-8 (issue #1831).zip

Thanks in advance for your attention. Have a nice day. hpwamr

Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher. See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.

Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").

Your comments and suggestions are always welcome... 😃

Joungkyun commented 3 years ago

Although it is an issue of uchardet, it is also an issue of libchardet because it uses the same algorithm as uchardet.

The string is too short for sampling. If the length of the remaining string with ASCII characters removed is less than 10, accurate sampling is unlikely. For example, ススト。 is recognized as TIS-620, but ススト。ススト。 is recognized as UTF-8.