Joungkyun / libchardet

libchardet - Mozilla's Universal Charset Detector C/C++ API
Other
105 stars 33 forks source link

Single UTF-8 character detected as Windows-1258 #17

Open hpwamr opened 4 years ago

hpwamr commented 4 years ago

Hello, For the development of Notepad3, we use the UCHARDET Charset Detector.

In issue #1848 we are faced with a problem of a Single "UTF-8" character which is detected as: Windows-1258 with reliability level of 72% by UCHARDET. 😕

Here the French "é" character (Précis:) !

71032731-cc90f480-217a-11ea-8313-ee011adf1467

In the following sample, it's the character character "¶" this is badly detectected as: "ΒΆ"

I would like to add to this issue a well-known text build_np3portableapp.cmd encoded in UTF-8 
with ONLY ONE non-ASCII character "delims=¶" on line 33 in this "shorted" batch file.

- This text is open faultily as "ISO-8859-7 (Greek)" with Notepad3 : "delims=ΒΆ"
- This text is open correctly as "UTF-8" with Notepad3 if I add an encoding tag ":: encoding: UTF-8"
- This text is open correctly as "UTF-8" with Noteapd++, Editpad Lite 7, Editplus, Notepad2, 
  Notepad2e, Notepad2-mod, Notepad2-zfuliu and VS Code,!!!

In attachment the 2 samples: Error Detection Single UTF-8 (issue #1848).zip

Thanks in advance for your attention. Have a nice day. hpwamr

Feel free to test the BETA version "Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z" or higher. See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.116.2708_BETA.paf.exe.7z.

Note: "Notepad3Portable BETA" can be used in "2 flavors" (with or without the extension ".7z").

Your comments and suggestions are always welcome... 😃

Joungkyun commented 3 years ago

Like #16, the number of strings that need to be determined is too short.

Note that the Windows-1258 issue does not occur on libchardet. This is due to the difference in tables in Vietnamese language between libchardet and uchardet.