Open ProgerXP opened 4 years ago
Found this entry in Notepad2 FAQ: http://www.flos-freeware.ch/development-releases/notepad2-FAQs.html#unicode-detection
There's a link to MSDN too. Maybe it is related.
That said, is charset detection performed by Win32 API or it can be influenced? I noticed it doesn't detect Shift-JIS correctly (at all, in neither locale) and this should be corrected if not too hard.
Please consider the following:
IS_TEXT_UNICODE_DBCS_LEADBYTE| IS_TEXT_UNICODE_STATISTICS
).The IS_TEXT_UNICODE_STATISTICS and IS_TEXT_UNICODE_REVERSE_STATISTICS tests use statistical analysis. These tests are not foolproof. The statistical tests assume certain amounts of variation between low and high bytes in a string, and some ASCII strings can slip through. For example, if lpv indicates the ASCII string 0x41, 0x0A, 0x0D, 0x1D (A\n\r^Z), the string passes the IS_TEXT_UNICODE_STATISTICS test, although failure would be preferable.
Very similar issue to #269, also tested in XP and Win10. Under Japanese locale, Notepad2 thinks the attached file is Unicode-encoded. Under another locale, it correctly detects it as UTF-8. I found that about 10% of similar text files meet this problem.
BadCharsetDetection.zip
I tried to locate a symbol or line that causes the mis-detection but could only determine that individual characters most of the time do not matter and that it happens to some combinations of character + trailing space. It's often enough to change just one symbol (e.g. a multi-byte symbol with a single-byte one) for to "fix" the detection.