ScintillaOrg / lexilla

A library of language lexers for use with Scintilla
https://www.scintilla.org/Lexilla.html
Other
179 stars 64 forks source link

Invalid (X/HT)ML entity at line end causes "different styles between \r and \n" #192

Closed rdipardo closed 1 year ago

rdipardo commented 1 year ago

SCE_H_TAGUNKNOWN bleeds into the first non-printing character adjacent to a malformed entity:

https://github.com/ScintillaOrg/lexilla/blob/5ffca5619467ae8beae9210b964eb2acdd431b29/lexers/LexHTML.cxx#L1876-L1883

In CRLF mode, this will split the EOL into different styles:

Text (SC_EOL_CRLF)
&
Computed Styles
{2}&
{0}
Visual Styles

SCE_H_TAGUNKNOWN-eol-split

Test Output
~\lexilla\test\examples\hypertext\SCE_H_TAGUNKNOWN.html:1: different styles between \r and \n at 2: 2, 0
~\lexilla\test\examples\hypertext\SCE_H_TAGUNKNOWN.html:1: has different styles with \n versus \r\n line ends

This patch introduces back-tracking to only style the visible characters of the malformed entity:

zufuliu commented 1 year ago

what about following fix (added else before if, the condition could be extracted as a function):

else if (chNext != ';' && chNext != '#' && !(IsASCII(chNext) && isalnum(chNext))    // Should check that '#' follows '&', but it is unlikely anyway...
    && chNext != '.' && chNext != '-' && chNext != '_' && chNext != ':') { // valid in XML
    styler.ColourTo(i, SCE_H_TAGUNKNOWN);
    state = SCE_H_DEFAULT;
}

SCE_H_TAGUNKNOWN can be changed to SCE_H_DEFAULT as this is not tag and using & with invalid entity name is valid.

zufuliu commented 1 year ago

what about following fix

needs extra check for chNext: https://github.com/ScintillaOrg/lexilla/blob/5ffca5619467ae8beae9210b964eb2acdd431b29/lexers/LexHTML.cxx#L1713-L1716

nyamatongwe commented 1 year ago

To fix just the initial problem, a minimum change appears to be.

@@ -1875,7 +1875,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
            }
            if (ch != '#' && !(IsASCII(ch) && isalnum(ch))  // Should check that '#' follows '&', but it is unlikely anyway...
                && ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
-               if (!IsASCII(ch))   // Possibly start of a multibyte character so don't allow this byte to be in entity style
+               if (!IsASCII(ch) || isLineEnd(ch))  // Possibly start of a multibyte character so don't allow this byte to be in entity style
                    styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
                else
                    styler.ColourTo(i, SCE_H_TAGUNKNOWN);

There may be a case for additional checks.

nyamatongwe commented 1 year ago

Looked up the definition of entities at https://www.w3.org/TR/2008/REC-xml-20081126/#sec-references.

Simplifying a bit it is:

  '&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';'
| '&' [A-Za-z:_] [A-Za-z0-9:_.-]* ';'

It also appears possible in XML to define non-ASCII entity names but that isn't done for HTML.

zufuliu commented 1 year ago

Third rule (named character reference) could be simplified to only allow alpha numeric + dot for all predefined entities https://html.spec.whatwg.org/multipage/syntax.html#character-references https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references https://www.w3.org/TR/xml-entity-names/ https://www.w3.org/2003/entities/2007/w3centities-f.ent

nyamatongwe commented 1 year ago
@@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
    const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
    // TODO: also handle + and - (except if they're part of ++ or --) and return keywords
    const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+   const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");

    int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
    int levelCurrent = levelPrev;
@@ -1872,13 +1873,9 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
            if (ch == ';') {
                styler.ColourTo(i, StateToPrint);
                state = SCE_H_DEFAULT;
-           }
-           if (ch != '#' && !(IsASCII(ch) && isalnum(ch))  // Should check that '#' follows '&', but it is unlikely anyway...
-               && ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
-               if (!IsASCII(ch))   // Possibly start of a multibyte character so don't allow this byte to be in entity style
-                   styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
-               else
-                   styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+           } else if (!setEntity.Contains(ch)) {
+               // Only allow [A-Za-z0-9.#-_:] in entities
+               styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
                state = SCE_H_DEFAULT;
            }
            break;
nyamatongwe commented 1 year ago

Scintilla bug 810 affected this code but the &— case seems OK with this change. https://sourceforge.net/p/scintilla/bugs/810/

zufuliu commented 1 year ago

how about adding continue after SCE_H_TAGUNKNOWN:

state = SCE_H_DEFAULT;
--i;
continue;

test case:

&amp<p>
&&amp;
rdipardo commented 1 year ago

how about adding continue after SCE_H_TAGUNKNOWN:

state = SCE_H_DEFAULT;
--i;
continue;

That works:

{2}&amp<p>{0}
{2}&{10}&amp;{0}
nyamatongwe commented 1 year ago

After experimenting a bit, I prefer the current styling that shows the first invalid character in 'red' so it can be seen as the problem. If it is the start of a valid tag like <p> then showing that all in blue makes it harder to see where the problem is.

Tests:

&
&amp<p>
&&amp;
&lt;
&lt;<b>
&b.epsi;
&b.epsi!
&カタナ;
&—;

Treating line ends and non-ASCII as early ends (due to character/CRLF slicing):

@@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
    const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
    // TODO: also handle + and - (except if they're part of ++ or --) and return keywords
    const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+   const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");

    int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
    int levelCurrent = levelPrev;
@@ -1872,13 +1873,12 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
            if (ch == ';') {
                styler.ColourTo(i, StateToPrint);
                state = SCE_H_DEFAULT;
-           }
-           if (ch != '#' && !(IsASCII(ch) && isalnum(ch))  // Should check that '#' follows '&', but it is unlikely anyway...
-               && ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
-               if (!IsASCII(ch))   // Possibly start of a multibyte character so don't allow this byte to be in entity style
-                   styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
-               else
-                   styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+           } else if (isLineEnd(ch) || !IsASCII(ch)) {
+               styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
+               state = SCE_H_DEFAULT;
+           } else if (!setEntity.Contains(ch)) {
+               // Only allow [A-Za-z0-9.#-_:] in entities
+               styler.ColourTo(i, SCE_H_TAGUNKNOWN);
                state = SCE_H_DEFAULT;
            }
            break;
{2}&{0}
{2}&amp<{0}p>
{2}&&{0}amp;
{10}&lt;{0}
{10}&lt;{1}<b>{0}
{10}&b.epsi;{0}
{2}&b.epsi!{0}
{2}&{0}カタナ;
{2}&{0}—;
zufuliu commented 1 year ago

with

} else if (!setEntity.Contains(ch)) {
    styler.ColourTo(i - 1, SCE_H_TAGUNKNOWN);
    state = SCE_H_DEFAULT;
    --i;
    continue;
}
<html>
&
&1
&A
&中
&amp<p>
&1<p>
&A<p>
&中<p>
&&amp;
&#1;
&#1;中
&A;<p>
&#1;<p>
&#1;中<p>
&amp
</html>

image

PS. I think it's might better to treat <p> as void tag, https://html.spec.whatwg.org/multipage/grouping-content.html#the-p-element. also from https://www.w3.org/TR/xml-entity-names/#htmlmathml

the HTML (but not XML) parser allows the trailing semicolon to be omitted. So &amp may be used as well as &amp; when using HTML.

nyamatongwe commented 1 year ago

From https://html.spec.whatwg.org/multipage/syntax.html#character-references:

Named character references The name must be one that is terminated by a U+003B SEMICOLON character (;)

I think it is better to be strict here.

rdipardo commented 1 year ago

"Strict" really means "easier to implement" in this case. It does appear that some entities in the HTML 5 specification have no semicolon in their canonical form.

$ curl -sL 'https://www.w3.org/TR/html5/entities.json' | jq -r '. | keys[] | select(index(";") | not)'
&AElig
&AMP
&Aacute
&Acirc
&Agrave
&Aring
&Atilde
&Auml
&COPY
&Ccedil
&ETH
&Eacute
&Ecirc
&Egrave
&Euml
&GT
&Iacute
&Icirc
&Igrave
&Iuml
&LT
&Ntilde
&Oacute
&Ocirc
&Ograve
&Oslash
&Otilde
&Ouml
&QUOT
&REG
&THORN
&Uacute
&Ucirc
&Ugrave
&Uuml
&Yacute
&aacute
&acirc
&acute
&aelig
&agrave
&amp
&aring
&atilde
&auml
&brvbar
&ccedil
&cedil
&cent
&copy
&curren
&deg
&divide
&eacute
&ecirc
&egrave
&eth
&euml
&frac12
&frac14
&frac34
&gt
&iacute
&icirc
&iexcl
&igrave
&iquest
&iuml
&laquo
&lt
&macr
&micro
&middot
&nbsp
&not
&ntilde
&oacute
&ocirc
&ograve
&ordf
&ordm
&oslash
&otilde
&ouml
&para
&plusmn
&pound
&quot
&raquo
&reg
&sect
&shy
&sup1
&sup2
&sup3
&szlig
&thorn
&times
&uacute
&ucirc
&ugrave
&uml
&uuml
&yacute
&yen
&yuml

But then, each of these also has a fully-delimited variant, so the most the implementation would lack is an optionally briefer syntax.

zufuliu commented 1 year ago

strict is fine. I prefer highlight only &amp (instead of &amp<) as SCE_H_TAGUNKNOWN: less code than checking eol and multi-byte, better error recovery (with --i).

nyamatongwe commented 1 year ago

@rdipardo It does appear that some entities in the HTML 5 specification have no semicolon in their canonical form.

The HTML 5 specification is the same as WHATWG: https://dev.w3.org/html5/spec-LC/syntax.html#character-references

While an entity may be defined without a semicolon, the syntax doesn't appear to allow that to be used within HTML. If it was allowed then the syntax should be defined somewhere so lexing code knows where to stop and thus what to discard and where to restart.

nyamatongwe commented 1 year ago

Folding is more stable by retreating for an invalid entity character with --i; continue; when adding an entity before a tag. Without this, the tag is temporarily invalid which means its fold level changes which may then lead to automatic unfolding. Because of this, the retreat code is included in 55c00616e7938b24a49aabfeb4755653e389051e.