ScintillaOrg / lexilla

A library of language lexers for use with Scintilla
https://www.scintilla.org/Lexilla.html
Other
163 stars 59 forks source link

[Ruby] Correct DBCS character handling #233

Closed zufuliu closed 3 months ago

zufuliu commented 3 months ago

following is highlighted differently in UTF-8 and DBCS code pages:

# encoding: utf-8
class A
    def 中? = false
    def 中! = false
    def 中=(value)
    end
end
# encoding: gbk
class A
    def 中? = false
    def 中! = false
    def 中=(value)
    end
end

https://docs.ruby-lang.org/en/master/syntax/methods_rdoc.html#label-Method+Names

Method names may be one of the operators or must start a letter or a character with the eighth bit set. It may contain letters, numbers, an _ (underscore or low line) or a character with the eighth bit set. The convention is to use underscores to separate words in a multiword method name:

Patch to fix the bug: runy-dbcs-0402.zip

diff --git a/lexers/LexRuby.cxx b/lexers/LexRuby.cxx
index d4bf314c..5ecf745f 100644
--- a/lexers/LexRuby.cxx
+++ b/lexers/LexRuby.cxx
@@ -835,8 +835,12 @@ void ColouriseRbDoc(Sci_PositionU startPos, Sci_Position length, int initStyle,
         char chNext2 = styler.SafeGetCharAt(i + 2);

         if (styler.IsLeadByte(ch)) {
+            if (state == SCE_RB_DEFAULT) {
+                styler.ColourTo(i - 1, state);
+                state = SCE_RB_WORD;
+            }
             chNext = chNext2;
-            chPrev = ' ';
+            chPrev = ch;
             i += 1;
             continue;
         }

chPrev = ch; is required for later isSafeWordcharOrHigh(chPrev) test, which is also reasonable (indicates previous character is non-ASCII instead of space).

It might not worth the complex to fix the bug (no one reported bugs for this), https://docs.ruby-lang.org/en/2.7.0/Encoding.html#class-Encoding-label-Script+encoding says:

The default script encoding is Encoding::UTF-8 after v2.0.

latest doc at https://docs.ruby-lang.org/en/master/encodings_rdoc.html#label-Script+Encoding doesn't mention with version changed to UTF-8.

Similar DBCS character handling pattern (chPrev = ' ') was copied (from LexHTML.cxx?) into other lexers, they may have similar bug. e.g. as PHP also treat non-ASCII bytes as identifier, so usage for chPrev and chPrev2 inside LexHTML.cxx may needs extra check. https://www.php.net/manual/en/language.variables.basics.php

Note: For our purposes here, a letter is a-z, A-Z, and the bytes from 128 through 255 (0x80-0xff).

zufuliu commented 3 months ago

I think the fix can be delayed untill there is bug in real code, also it not fix DBCS heredoc delimiter. Code use DBCS character (instead of ASCII or UTF-8) as identifier is non-portable.

# encoding: gbk
puts <<中
#{1+2}
中
zufuliu commented 3 months ago

Close this as won't fix. Move styler.IsLeadByte() to the end of for loop (before chPrev = ch;) seems will fix both problems, but it's hard to test due to advance_char(), redo_char() and InterpolateVariable() changes ch, chNext and chNext2.