Closed myonlylonely closed 1 year ago
Thanks for the issue report but I can't reproduce this display error, when I open the data the syntax highlighting looks good. This issue seems similar to issue #47
Are you using Windows 10 or Windows 11? And which version of the plug-in are you using?
Thanks for the issue report but I can't reproduce this display error, when I open the data the syntax highlighting looks good. This issue seems similar to issue #47
Are you using Windows 10 or Windows 11? And which version of the plug-in are you using?
I'm using Windows 10 21H2 (64bit), Notepad++ 8.4.8 (64bit), CSVLint 0.4.6.2(insalled from Notepad++ plugin management).
I find that this issue only exists under UTF8 encoding, the ANSI encoding works fine. ANSI encoding works fine:
UTF8 encoding does not work:
The file I used is attached: UTF8 encoding does not work: test-UTF8.csv ANSI encoding works fine: test-ANSI.csv
The file I used is attached: UTF8 encoding does not work: test-UTF8.csv ANSI encoding works fine: test-ANSI.csv
I'm using Windows 10 (64bit), Notepad++ 8.4.9 (64bit), CSVLint 0.4.6.3beta. For me highlighting with your files works fine. They look the same. You can try his new beta: https://github.com/BdR76/CSVLint/issues/46#issuecomment-1368141899
@BdR76 if there is no additional reporting for the 0.4.6.3beta, you can publish it- and close my findings (or am I supposed to do that?)
The new version still does not work.
have you tried a clean notepad++ (you can use the zipped one without install) and without additional plugins. maybe other plugins or settings interfere. I have no other explanation, it works for me with your samples
have you tried a clean notepad++ (you can use the zipped one without install) and without additional plugins. maybe other plugins or settings interfere. I have no other explanation, it works for me with your samples
![]()
Yes, I tried a clean portable(zip) version of Notepad++ and use the CSVLint 0.4.6.3beta, still doesn't work as expected.
I tried a clean portable(zip) version of Notepad++ and use the CSVLint 0.4.6.3beta, still doesn't work as expected.
@myonlylonely, can you check if your Windows system is using an East Asian "ANSI" code page, i.e., one of these?
In Notepad++, go to ? on the toolbar, then "Debug Info" and look at "Current ANSI codepage".
Or open the Command Prompt and check the Registry. For example, if the system code page is "Simplified Chinese GBK", it will look something like this:
> reg query HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage /s /f "CP"
HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage
AllowDeprecatedCP REG_DWORD 0x42414421
ACP REG_SZ 936
OEMCP REG_SZ 437
MACCP REG_SZ 936
End of search: 4 match(es) found.
I think the reason other people on this thread can't reproduce the issue is that their Windows systems are using a Western European code page (1252) or the new UTF-8 code page (65001).
@BdR76, if my guess is right, the problem comes from what I did here:
The lexer falls back to System.Text.Encoding.Default
if the OS is not using UTF-8. It never looks at the document's encoding, i.e., never calls PluginBase.CurrentScintillaGateway.GetCodePage()
.
Apparently you can have a situation where the OS is using a (fixed-width) DBCS code page, but Notepad++ displays the text as multi-byte UTF-8, so the lexer will still skip over bytes as it did before 60709b3.
Hard to test this without a spare machine to experiment with the OS encoding . . . :thinking:
Your guess is right. The debug info:
Current ANSI codepage : 936
It is the default encoding in this language version of Windows.
Current ANSI codepage : 936
Thank you. I can reproduce the bad colorization on the English version of Windows 10 just by setting the "ACP" Registry key to "936". I would also guess that issue #47 has the same root cause.
Re. the description East Asian "ANSI" code page
My older post needs a slight correction.
"Simplified Chinese GBK" and friends are known as Double Byte Character Sets or DBCS code pages.
The distinction is crucial because the lexer parses files 1 byte at a time. See, for example, how the "next column" style starts between the CR
and LF
of a Windows-style EOL:
Parsing a byte stream works fine with single-byte OS code pages (e.g. Windows-1252) and, crucially, variable-length multi-byte code pages like UTF-8, which was broken before 60709b3.
Fixing this issue would mean changing the lexer to sometimes parse the file in double-byte mode (when the OS uses a DBCS code page), while still using the current byte-by-byte parsing the rest of the time, i.e., 60709b3 was a good improvement; it just didn't anticipate DBCS code pages, which obviously need special treatment.
Scintilla has built-in support for DBCS code pages; it's not clear how this could benefit the plugin's lexer. Changing the document's properties really isn't a lexer's job, and Notepad++ would just override them as soon as the file was reloaded or saved. It's more likely the lexing algorithm needs to be more adaptable to fixed-width double-byte characters.
On the other hand, Windows introduced the UTF-8 code page in version 1903 to signal that DBCS are "legacy" code pages. The plugin's documentation could just say "DBCS code pages are not supported — use UTF-8 instead". A notice like that could go into a pinned "meta" issue to collect duplicate bug reports like #47.
Re. the description
East Asian "ANSI" code page
My older post needs a slight correction.
"Simplified Chinese GBK" and friends are known as Double Byte Character Sets or DBCS code pages.
The distinction is crucial because the lexer parses files 1 byte at a time. See, for example, how the "next column" style starts between the
CR
andLF
of a Windows-style EOL:![]()
Parsing a byte stream works fine with single-byte OS code pages (e.g. Windows-1252) and, crucially, variable-length multi-byte code pages like UTF-8, which was broken before 60709b3.
Fixing this issue would mean changing the lexer to sometimes parse the file in double-byte mode (when the OS uses a DBCS code page), while still using the current byte-by-byte parsing the rest of the time, i.e., 60709b3 was a good improvement; it just didn't anticipate DBCS code pages, which obviously need special treatment.
Scintilla has built-in support for DBCS code pages; it's not clear how this could benefit the plugin's lexer. Changing the document's properties really isn't a lexer's job, and Notepad++ would just override them as soon as the file was reloaded or saved. It's more likely the lexing algorithm needs to be more adaptable to fixed-width double-byte characters.
On the other hand, Windows introduced the UTF-8 code page in version 1903 to signal that DBCS are "legacy" code pages. The plugin's documentation could just say "DBCS code pages are not supported — use UTF-8 instead". A notice like that could go into a pinned "meta" issue to collect duplicate bug reports like #47.
Does that mean this issue will never be fixed?
Does that mean this issue will never be fixed?
It means that supporting DBCS code pages is more of a missing feature than a "bug". When some one has figured out how to do it, then it will be fixed. That probably won't be anytime soon. Unfortunately this plugin targets a very old .NET Framework version that's poorly suited for interacting with low-level C++ libraries like Scintilla.
Thank you for the detailed explanation. I guess I have to use VSCode or WebStorm with rainbow plugins which provide similar highlight features but have no problem dealing with DBCS.
@BdR76, for reference, this article explains what the lexer should be doing with DBCS-encoded text:
To interpret a DBCS string, an application must start at the beginning of the string and scan forward. It keeps track when it encounters a lead byte in the string, and treats the next byte as the trailing part of the same character. [...] The application cannot just back up one byte to see if the preceding byte is a lead byte, as that byte value might be eligible to be used as both a lead byte and a trail byte. [...] In other words, substring searches are much more complicated with a DBCS than with either SBCSs [Single Byte Character Sets] or Unicode.
@rdipardo has submitted a fix for this issue, I've rebuild the DLL so @myonlylonely can you verify that it works now?
You can download the latest development build of the DLL (either x86 or x64), place dll in your .\Program Files\Notepad++\plugins\CSVLint\
folder, then restart Notepad++ to test it.
@BdR76 Unfortunately, it still doesn't work. The result remains the same. UTF8 encoding files doesn't work.
ANSI encoding files works.
Unfortunately, it still doesn't work.
That is to be expected. The OP in #52 most likely has a PC set to Windows 1252, the default ANSI code page for PCs in English and European locales.
To recap, 1252
is a single-byte encoding; that includes even the high ordinals where European vowels are mapped:
> python3 -c "print('é'.encode('cp1252'))"
b'\xe9'
East Asian ANSI code pages like 936
are double-byte:
> python3 -c "print('é'.encode('936'))"
b'\xa8\xa6'
This will continue to be an issue until the lexer knows how to properly segment double-byte characters, which will probably involve some usage of Scintilla's IsDBCSLeadByte
API method, or a Win32 equivalent such as IsDBCSLeadByteEx
.
@myonlylonely, have you tried the Notepad2 editor? It's got its own CSV lexer built in:
@myonlylonely, have you tried the Notepad2 editor? It's got its own CSV lexer built in:
Yes, the same file works on Notepad2.
Yes, the same file works on Notepad2.
The screen capture shows test-ANSI.csv
, but the problem is with the UTF-8 file. Does Notepad2 get the columns right when the data is saved in UTF-8 format?
The screen capture shows
test-ANSI.csv
, but the problem is with the UTF-8 file. Does Notepad2 get the columns right when the data is saved in UTF-8 format?
Yes, the UTF8 encoding file also works on Notepad2.
Does that mean this issue will never be fixed?
I can reproduce the error and I've been trying to fix this. But it's not as easy as I thought. I've been delaying the next release of the plug-in, hoping it could include a fix for this issue. But in the mean time there also have a lot other updates and bugfixes, so maybe I'll make a new release anyway.
Just know that I want to fix this issue, I've also posted a question on the Notepad++ dev community hopefully that will lead to some new insights.
@myonlylonely I think I found the fix to make the CSV Lint plug-in work correctly regardless of the OS language settings (code page 936 etc), can you try the development DLL again?
You can download the latest development build of the DLL (either x86 or x64) which has version 4.6.3β6
You can download the latest development build of the DLL (either x86 or x64) which has version
4.6.3β6
I've tested and it solves the issue, with no regressions from before 60709b3 that I can find. 👍🏼
Here's what I missed.
Notepad++ encodes the UTF-8 file as UTF-8, regardless of the system's ANSI code page.
The trick was to follow the editor's encoding, not the system's — i.e., use SCI_GETCODEPAGE
, not GetACP()
to determine the buffer's encoding.
You can see how starkly different the UTF-8 encoding is from DBCS by running this C# script:
// dbcs_encode.csx
using System;
using System.Text;
using static System.Console;
int cp = 936;
var s = $"{(char)0xef}{(char)0xbb}{(char)0xbf}测试123,测试2,测试,测试填充列\r\n";
var dbcs = Encoding.GetEncoding(cp);
var utf8Bytes = Encoding.UTF8.GetBytes(s);
var dbcsBytes = dbcs.GetBytes(s);
Func<byte, string> byteToString = b => {
if (b == 0xd) return "CR";
else if (b == 0xa) return "LF";
else return $"{b:X2}";
};
var asUTF8Bytes = String.Join(" ", utf8Bytes.Select(b => byteToString(b)));
var asDBCSBytes = String.Join(" ", dbcsBytes.Select(b => byteToString(b)));
WriteLine();
WriteLine("As UTF-8:");
WriteLine(asUTF8Bytes);
WriteLine();
WriteLine($"As {dbcs.EncodingName}:");
WriteLine(asDBCSBytes);
> csi dbcs_encode.csx
As UTF-8:
C3 AF C2 BB C2 BF E6 B5 8B E8 AF 95 31 32 33 2C E6 B5 8B E8 AF 95 32 2C E6 B5 8B E8 AF 95 2C E6 B5 8B E8 AF 95 E5 A1 AB E5 85 85 E5 88 97 CR LF
As Chinese Simplified (GB2312):
3F 3F 3F B2 E2 CA D4 31 32 33 2C B2 E2 CA D4 32 2C B2 E2 CA D4 2C B2 E2 CA D4 CC EE B3 E4 C1 D0 CR LF
When — and only when — the file is saved as ANSI, the editor's encoding will match the system's and use the DBCS encoding.
This is true even if we call SCI_GETCODEPAGE
. For some reason, it falls back to the same value as GetACP()
whenever the buffer is not UTF-8. Even if the status bar says "ANSI", SCI_GETCODEPAGE
will say 936
, if that's what the system is using.
By the same token, if the system is using the new UTF-8 code page, SCI_GETCODEPAGE
returns 650001
even when the file is really saved as ANSI.
This was always the case, though; it may be a separate issue, but it's not a regression.
@myonlylonely I think I found the fix to make the CSV Lint plug-in work correctly regardless of the OS language settings (code page 936 etc), can you try the development DLL again?
You can download the latest development build of the DLL (either x86 or x64) which has version
4.6.3β6
That's great! 👍🏼 I have confirmed that the development build works great!
@rdipardo Thanks for clarifying, text encoding can be a tricky subject. I think the plug-in still has an issue with converting ANSI files in some cases (when sort, reformat etc) but I'm glad the syntax highlighting and code pages is fixed now.
@myonlylonely Thanks for confirming, I'll prepare the new release of the plug-in.
CSV data for testing: