Open hpwamr opened 4 years ago
Although it is an issue of uchardet, it is also an issue of libchardet because it uses the same algorithm as uchardet.
The string is too short for sampling.
If the length of the remaining string with ASCII characters removed is less than 10, accurate sampling is unlikely.
For example, ススト。
is recognized as TIS-620, but ススト。ススト。
is recognized as UTF-8.
Hello, For the development of Notepad3, we use the UCHARDET Charset Detector.
In issue #1831 we are faced with a problem of poor Japanese "UTF-8" detection which is detected as: TIS-620 (Windows-874 (Thai)) with reliability level of 99% by UCHARDET. 😕
These text editors detect it as UTF-8 and displays it correctly
Here the bad detection as "TIS-620"
Here the correct detection as "UTF-8"
In attachment the original sample: Error Detection encoding_utf-8 (issue #1831).zip
Thanks in advance for your attention. Have a nice day. hpwamr
Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher. See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.
Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").
Your comments and suggestions are always welcome... 😃