ScintillaOrg / lexilla

A library of language lexers for use with Scintilla
https://www.scintilla.org/Lexilla.html
Other
187 stars 67 forks source link

Protect against incorrect case in keyword lists #259

Closed nyamatongwe closed 3 months ago

nyamatongwe commented 3 months ago

Some languages are case-insensitive, treating keywords like 'if' and 'IF' as equivalent. This is implemented by lower-casing words found in the document and checking if they are in a keyword list. This is faster than performing a case-insensitive search in a keyword list.

Keyword lists are often defined or modified by a user and it is easy to incorrectly add an upper-case keyword. These will never match a lower-cased word so will not highlight correctly.

This problem could be avoided by lower-casing case-insensitive keyword lists. This would be implemented in variants of WordList::Set and SubStyles::SetIdentifiers or with an optional argument to the existing methods.

zufuliu commented 3 months ago

I'm using https://github.com/zufuliu/notepad4/blob/main/scintilla/lexlib/WordList.h#L17

    enum KeywordAttr {
        KeywordAttr_Default = 0,
        KeywordAttr_MakeLower = 1,
        KeywordAttr_PreSorted = 2,
    };

implemented at https://github.com/zufuliu/notepad4/blob/main/scintilla/lexlib/WordList.cxx#L127 and https://github.com/zufuliu/notepad4/blob/main/scintilla/lexlib/WordList.cxx#L139

zufuliu commented 3 months ago

implemented at

Message::SetKeyWords can use changed to use lower 8-bit (or 16-bit) to store index and rest bits to store attributes. https://github.com/zufuliu/notepad4/blob/main/scintilla/src/ScintillaBase.cxx#L1120

    case Message::SetKeyWords:
        DocumentLexState()->SetWordList(wParam & 0xff,
            static_cast<int>(wParam >> 8), ConstCharPtrFromSPtr(lParam));
        break;
nyamatongwe commented 3 months ago

By specifying case correction in the API, applications need to know which lexers and keyword lists are case-insensitive. It's better to have lexers fix the case since they know when that is needed.

zufuliu commented 3 months ago

It's better to have lexers fix the case since they know when that is needed.

This only works for newer lexers (those directly inherited from DefaultLexer or ILexer5).

nyamatongwe commented 3 months ago

@zufuliu This only works for newer lexers

Older lexers could be provided with a WordList::ConvertToLowerCase to convert an existing WordList to lower case (and avoid doing this for each lex with a member variable).

However, the most commonly used lexers are object lexers. Case standardization can be implemented incrementally in each lexer over time.

zufuliu commented 3 months ago

OK, most old style lexers are not maintained/updated in recent years.

nyamatongwe commented 3 months ago

Library support for this committed along with use in the HTML lexer. Also includes test cases for HTML.