Invalid file charset detection under Japanese locale

ProgerXP commented 4 years ago

Very similar issue to #269, also tested in XP and Win10. Under Japanese locale, Notepad2 thinks the attached file is Unicode-encoded. Under another locale, it correctly detects it as UTF-8. I found that about 10% of similar text files meet this problem.

BadCharsetDetection.zip

I tried to locate a symbol or line that causes the mis-detection but could only determine that individual characters most of the time do not matter and that it happens to some combinations of character + trailing space. It's often enough to change just one symbol (e.g. a multi-byte symbol with a single-byte one) for to "fix" the detection.

ProgerXP commented 4 years ago

Found this entry in Notepad2 FAQ: http://www.flos-freeware.ch/development-releases/notepad2-FAQs.html#unicode-detection

There's a link to MSDN too. Maybe it is related.

That said, is charset detection performed by Win32 API or it can be influenced? I noticed it doesn't detect Shift-JIS correctly (at all, in neither locale) and this should be corrected if not too hard.

cshnik commented 3 years ago

Please consider the following:

IsTextUnicode function is used when trying to identify whether provided text is unicode or not.
It was found that function works differently on japanese/english locale OS. It results with 0 when using english locale, and 0x402 for japanese locale (which is treated as IS_TEXT_UNICODE_DBCS_LEADBYTE| IS_TEXT_UNICODE_STATISTICS).
Committed change address specified improper detection of unicode text when using japanese locale and matches M$ recommendation:

The IS_TEXT_UNICODE_STATISTICS and IS_TEXT_UNICODE_REVERSE_STATISTICS tests use statistical analysis. These tests are not foolproof. The statistical tests assume certain amounts of variation between low and high bytes in a string, and some ASCII strings can slip through. For example, if lpv indicates the ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, although failure would be preferable.
UTF8 detection is implemented directly in Notepad2e code.
Notepad2 ― Encoding Tutorial should help to resolve the issue with Shift-JIS encoding detection.

ProgerXP / Notepad2e

Invalid file charset detection under Japanese locale #270