BdR76 / CSVLint

CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
GNU General Public License v3.0
151 stars 8 forks source link

Highlight error when Windows is using an East Asian "ANSI" code page #50

Closed myonlylonely closed 1 year ago

myonlylonely commented 1 year ago
image

CSV data for testing:

测试123,测试2,测试,测试填充列
0,1,2,3
4,5,6,7
BdR76 commented 1 year ago

Thanks for the issue report but I can't reproduce this display error, when I open the data the syntax highlighting looks good. This issue seems similar to issue #47

Are you using Windows 10 or Windows 11? And which version of the plug-in are you using?

myonlylonely commented 1 year ago

Thanks for the issue report but I can't reproduce this display error, when I open the data the syntax highlighting looks good. This issue seems similar to issue #47

Are you using Windows 10 or Windows 11? And which version of the plug-in are you using?

I'm using Windows 10 21H2 (64bit), Notepad++ 8.4.8 (64bit), CSVLint 0.4.6.2(insalled from Notepad++ plugin management).

myonlylonely commented 1 year ago

I find that this issue only exists under UTF8 encoding, the ANSI encoding works fine. ANSI encoding works fine:

image

UTF8 encoding does not work:

image
myonlylonely commented 1 year ago

The file I used is attached: UTF8 encoding does not work: test-UTF8.csv ANSI encoding works fine: test-ANSI.csv

Friedi commented 1 year ago

The file I used is attached: UTF8 encoding does not work: test-UTF8.csv ANSI encoding works fine: test-ANSI.csv

I'm using Windows 10 (64bit), Notepad++ 8.4.9 (64bit), CSVLint 0.4.6.3beta. For me highlighting with your files works fine. They look the same. You can try his new beta: https://github.com/BdR76/CSVLint/issues/46#issuecomment-1368141899

@BdR76 if there is no additional reporting for the 0.4.6.3beta, you can publish it- and close my findings (or am I supposed to do that?)

myonlylonely commented 1 year ago

The new version still does not work.

image

Friedi commented 1 year ago

have you tried a clean notepad++ (you can use the zipped one without install) and without additional plugins. maybe other plugins or settings interfere. I have no other explanation, it works for me with your samples image image

myonlylonely commented 1 year ago

have you tried a clean notepad++ (you can use the zipped one without install) and without additional plugins. maybe other plugins or settings interfere. I have no other explanation, it works for me with your samples image image

Yes, I tried a clean portable(zip) version of Notepad++ and use the CSVLint 0.4.6.3beta, still doesn't work as expected.

rdipardo commented 1 year ago

I tried a clean portable(zip) version of Notepad++ and use the CSVLint 0.4.6.3beta, still doesn't work as expected.

@myonlylonely, can you check if your Windows system is using an East Asian "ANSI" code page, i.e., one of these?

In Notepad++, go to ? on the toolbar, then "Debug Info" and look at "Current ANSI codepage".

Or open the Command Prompt and check the Registry. For example, if the system code page is "Simplified Chinese GBK", it will look something like this:

> reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /s /f "CP"

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
    AllowDeprecatedCP    REG_DWORD    0x42414421
    ACP    REG_SZ    936
    OEMCP    REG_SZ    437
    MACCP    REG_SZ    936

End of search: 4 match(es) found.

I think the reason other people on this thread can't reproduce the issue is that their Windows systems are using a Western European code page (1252) or the new UTF-8 code page (65001).

@BdR76, if my guess is right, the problem comes from what I did here:

https://github.com/BdR76/CSVLint/blob/a0fd0fc84be4c98e48b51042936e05264cf7a41f/CSVLintNppPlugin/PluginInfrastructure/Lexer.cs#L1009

The lexer falls back to System.Text.Encoding.Default if the OS is not using UTF-8. It never looks at the document's encoding, i.e., never calls PluginBase.CurrentScintillaGateway.GetCodePage().

Apparently you can have a situation where the OS is using a (fixed-width) DBCS code page, but Notepad++ displays the text as multi-byte UTF-8, so the lexer will still skip over bytes as it did before 60709b3.

Hard to test this without a spare machine to experiment with the OS encoding . . . :thinking:

myonlylonely commented 1 year ago

Your guess is right. The debug info:

Current ANSI codepage : 936

It is the default encoding in this language version of Windows.

rdipardo commented 1 year ago
Current ANSI codepage : 936

Thank you. I can reproduce the bad colorization on the English version of Windows 10 just by setting the "ACP" Registry key to "936". I would also guess that issue #47 has the same root cause.

rdipardo commented 1 year ago

Re. the description East Asian "ANSI" code page

My older post needs a slight correction.

"Simplified Chinese GBK" and friends are known as Double Byte Character Sets or DBCS code pages.

The distinction is crucial because the lexer parses files 1 byte at a time. See, for example, how the "next column" style starts between the CR and LF of a Windows-style EOL:

CSVLint-split-EOL

Parsing a byte stream works fine with single-byte OS code pages (e.g. Windows-1252) and, crucially, variable-length multi-byte code pages like UTF-8, which was broken before 60709b3.

Fixing this issue would mean changing the lexer to sometimes parse the file in double-byte mode (when the OS uses a DBCS code page), while still using the current byte-by-byte parsing the rest of the time, i.e., 60709b3 was a good improvement; it just didn't anticipate DBCS code pages, which obviously need special treatment.

Scintilla has built-in support for DBCS code pages; it's not clear how this could benefit the plugin's lexer. Changing the document's properties really isn't a lexer's job, and Notepad++ would just override them as soon as the file was reloaded or saved. It's more likely the lexing algorithm needs to be more adaptable to fixed-width double-byte characters.

On the other hand, Windows introduced the UTF-8 code page in version 1903 to signal that DBCS are "legacy" code pages. The plugin's documentation could just say "DBCS code pages are not supported — use UTF-8 instead". A notice like that could go into a pinned "meta" issue to collect duplicate bug reports like #47.

myonlylonely commented 1 year ago

Re. the description East Asian "ANSI" code page

My older post needs a slight correction.

"Simplified Chinese GBK" and friends are known as Double Byte Character Sets or DBCS code pages.

The distinction is crucial because the lexer parses files 1 byte at a time. See, for example, how the "next column" style starts between the CR and LF of a Windows-style EOL:

CSVLint-split-EOL

Parsing a byte stream works fine with single-byte OS code pages (e.g. Windows-1252) and, crucially, variable-length multi-byte code pages like UTF-8, which was broken before 60709b3.

Fixing this issue would mean changing the lexer to sometimes parse the file in double-byte mode (when the OS uses a DBCS code page), while still using the current byte-by-byte parsing the rest of the time, i.e., 60709b3 was a good improvement; it just didn't anticipate DBCS code pages, which obviously need special treatment.

Scintilla has built-in support for DBCS code pages; it's not clear how this could benefit the plugin's lexer. Changing the document's properties really isn't a lexer's job, and Notepad++ would just override them as soon as the file was reloaded or saved. It's more likely the lexing algorithm needs to be more adaptable to fixed-width double-byte characters.

On the other hand, Windows introduced the UTF-8 code page in version 1903 to signal that DBCS are "legacy" code pages. The plugin's documentation could just say "DBCS code pages are not supported — use UTF-8 instead". A notice like that could go into a pinned "meta" issue to collect duplicate bug reports like #47.

Does that mean this issue will never be fixed?

rdipardo commented 1 year ago

Does that mean this issue will never be fixed?

It means that supporting DBCS code pages is more of a missing feature than a "bug". When some one has figured out how to do it, then it will be fixed. That probably won't be anytime soon. Unfortunately this plugin targets a very old .NET Framework version that's poorly suited for interacting with low-level C++ libraries like Scintilla.

myonlylonely commented 1 year ago

Thank you for the detailed explanation. I guess I have to use VSCode or WebStorm with rainbow plugins which provide similar highlight features but have no problem dealing with DBCS.

rdipardo commented 1 year ago

@BdR76, for reference, this article explains what the lexer should be doing with DBCS-encoded text:

To interpret a DBCS string, an application must start at the beginning of the string and scan forward. It keeps track when it encounters a lead byte in the string, and treats the next byte as the trailing part of the same character. [...] The application cannot just back up one byte to see if the preceding byte is a lead byte, as that byte value might be eligible to be used as both a lead byte and a trail byte. [...] In other words, substring searches are much more complicated with a DBCS than with either SBCSs [Single Byte Character Sets] or Unicode.

BdR76 commented 1 year ago

@rdipardo has submitted a fix for this issue, I've rebuild the DLL so @myonlylonely can you verify that it works now?

You can download the latest development build of the DLL (either x86 or x64), place dll in your .\Program Files\Notepad++\plugins\CSVLint\ folder, then restart Notepad++ to test it.

myonlylonely commented 1 year ago

@BdR76 Unfortunately, it still doesn't work. The result remains the same. UTF8 encoding files doesn't work.

image

ANSI encoding files works.

image
rdipardo commented 1 year ago

Unfortunately, it still doesn't work.

That is to be expected. The OP in #52 most likely has a PC set to Windows 1252, the default ANSI code page for PCs in English and European locales.

To recap, 1252 is a single-byte encoding; that includes even the high ordinals where European vowels are mapped:

> python3 -c "print('é'.encode('cp1252'))"

b'\xe9'

East Asian ANSI code pages like 936 are double-byte:

> python3 -c "print('é'.encode('936'))"

b'\xa8\xa6'

This will continue to be an issue until the lexer knows how to properly segment double-byte characters, which will probably involve some usage of Scintilla's IsDBCSLeadByte API method, or a Win32 equivalent such as IsDBCSLeadByteEx.

rdipardo commented 1 year ago

@myonlylonely, have you tried the Notepad2 editor? It's got its own CSV lexer built in:

csv-color

myonlylonely commented 1 year ago

@myonlylonely, have you tried the Notepad2 editor? It's got its own CSV lexer built in:

Yes, the same file works on Notepad2.

image

rdipardo commented 1 year ago

Yes, the same file works on Notepad2.

The screen capture shows test-ANSI.csv, but the problem is with the UTF-8 file. Does Notepad2 get the columns right when the data is saved in UTF-8 format?

myonlylonely commented 1 year ago

The screen capture shows test-ANSI.csv, but the problem is with the UTF-8 file. Does Notepad2 get the columns right when the data is saved in UTF-8 format?

Yes, the UTF8 encoding file also works on Notepad2. image

BdR76 commented 1 year ago

Does that mean this issue will never be fixed?

I can reproduce the error and I've been trying to fix this. But it's not as easy as I thought. I've been delaying the next release of the plug-in, hoping it could include a fix for this issue. But in the mean time there also have a lot other updates and bugfixes, so maybe I'll make a new release anyway.

Just know that I want to fix this issue, I've also posted a question on the Notepad++ dev community hopefully that will lead to some new insights.

BdR76 commented 1 year ago

@myonlylonely I think I found the fix to make the CSV Lint plug-in work correctly regardless of the OS language settings (code page 936 etc), can you try the development DLL again?

You can download the latest development build of the DLL (either x86 or x64) which has version 4.6.3β6

rdipardo commented 1 year ago

You can download the latest development build of the DLL (either x86 or x64) which has version 4.6.3β6

I've tested and it solves the issue, with no regressions from before 60709b3 that I can find. 👍🏼

Here's what I missed.

Notepad++ encodes the UTF-8 file as UTF-8, regardless of the system's ANSI code page. The trick was to follow the editor's encoding, not the system's — i.e., use SCI_GETCODEPAGE, not GetACP() to determine the buffer's encoding.

You can see how starkly different the UTF-8 encoding is from DBCS by running this C# script:

// dbcs_encode.csx
using System;
using System.Text;
using static System.Console;

int cp = 936;
var s = $"{(char)0xef}{(char)0xbb}{(char)0xbf}测试123,测试2,测试,测试填充列\r\n";
var dbcs = Encoding.GetEncoding(cp);
var utf8Bytes = Encoding.UTF8.GetBytes(s);
var dbcsBytes = dbcs.GetBytes(s);

Func<byte, string> byteToString = b => {
    if (b == 0xd) return "CR";
    else if (b == 0xa) return "LF";
    else return $"{b:X2}";
  };

var asUTF8Bytes = String.Join(" ", utf8Bytes.Select(b => byteToString(b)));
var asDBCSBytes = String.Join(" ", dbcsBytes.Select(b => byteToString(b)));

WriteLine();
WriteLine("As UTF-8:");
WriteLine(asUTF8Bytes);
WriteLine();
WriteLine($"As {dbcs.EncodingName}:");
WriteLine(asDBCSBytes);
> csi dbcs_encode.csx

As UTF-8:
C3 AF C2 BB C2 BF E6 B5 8B E8 AF 95 31 32 33 2C E6 B5 8B E8 AF 95 32 2C E6 B5 8B E8 AF 95 2C E6 B5 8B E8 AF 95 E5 A1 AB E5 85 85 E5 88 97 CR LF

As Chinese Simplified (GB2312):
3F 3F 3F B2 E2 CA D4 31 32 33 2C B2 E2 CA D4 32 2C B2 E2 CA D4 2C B2 E2 CA D4 CC EE B3 E4 C1 D0 CR LF

When — and only when — the file is saved as ANSI, the editor's encoding will match the system's and use the DBCS encoding.

This is true even if we call SCI_GETCODEPAGE. For some reason, it falls back to the same value as GetACP() whenever the buffer is not UTF-8. Even if the status bar says "ANSI", SCI_GETCODEPAGE will say 936, if that's what the system is using.

By the same token, if the system is using the new UTF-8 code page, SCI_GETCODEPAGE returns 650001 even when the file is really saved as ANSI.

ansi-buffer-as-utf8

This was always the case, though; it may be a separate issue, but it's not a regression.

myonlylonely commented 1 year ago

@myonlylonely I think I found the fix to make the CSV Lint plug-in work correctly regardless of the OS language settings (code page 936 etc), can you try the development DLL again?

You can download the latest development build of the DLL (either x86 or x64) which has version 4.6.3β6

That's great! 👍🏼 I have confirmed that the development build works great!

BdR76 commented 1 year ago

@rdipardo Thanks for clarifying, text encoding can be a tricky subject. I think the plug-in still has an issue with converting ANSI files in some cases (when sort, reformat etc) but I'm glad the syntax highlighting and code pages is fixed now.

@myonlylonely Thanks for confirming, I'll prepare the new release of the plug-in.