Closed rdipardo closed 1 year ago
what about following fix (added else
before if
, the condition could be extracted as a function):
else if (chNext != ';' && chNext != '#' && !(IsASCII(chNext) && isalnum(chNext)) // Should check that '#' follows '&', but it is unlikely anyway...
&& chNext != '.' && chNext != '-' && chNext != '_' && chNext != ':') { // valid in XML
styler.ColourTo(i, SCE_H_TAGUNKNOWN);
state = SCE_H_DEFAULT;
}
SCE_H_TAGUNKNOWN
can be changed to SCE_H_DEFAULT
as this is not tag and using &
with invalid entity name is valid.
what about following fix
needs extra check for chNext
:
https://github.com/ScintillaOrg/lexilla/blob/5ffca5619467ae8beae9210b964eb2acdd431b29/lexers/LexHTML.cxx#L1713-L1716
To fix just the initial problem, a minimum change appears to be.
@@ -1875,7 +1875,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
}
if (ch != '#' && !(IsASCII(ch) && isalnum(ch)) // Should check that '#' follows '&', but it is unlikely anyway...
&& ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
- if (!IsASCII(ch)) // Possibly start of a multibyte character so don't allow this byte to be in entity style
+ if (!IsASCII(ch) || isLineEnd(ch)) // Possibly start of a multibyte character so don't allow this byte to be in entity style
styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
else
styler.ColourTo(i, SCE_H_TAGUNKNOWN);
There may be a case for additional checks.
Looked up the definition of entities at https://www.w3.org/TR/2008/REC-xml-20081126/#sec-references.
Simplifying a bit it is:
'&#' [0-9]+ ';'
| '&#x' [0-9a-fA-F]+ ';'
| '&' [A-Za-z:_] [A-Za-z0-9:_.-]* ';'
It also appears possible in XML to define non-ASCII entity names but that isn't done for HTML.
Third rule (named character reference) could be simplified to only allow alpha numeric + dot for all predefined entities https://html.spec.whatwg.org/multipage/syntax.html#character-references https://html.spec.whatwg.org/multipage/named-characters.html#named-character-references https://www.w3.org/TR/xml-entity-names/ https://www.w3.org/2003/entities/2007/w3centities-f.ent
@@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
// TODO: also handle + and - (except if they're part of ++ or --) and return keywords
const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+ const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");
int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
int levelCurrent = levelPrev;
@@ -1872,13 +1873,9 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
if (ch == ';') {
styler.ColourTo(i, StateToPrint);
state = SCE_H_DEFAULT;
- }
- if (ch != '#' && !(IsASCII(ch) && isalnum(ch)) // Should check that '#' follows '&', but it is unlikely anyway...
- && ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
- if (!IsASCII(ch)) // Possibly start of a multibyte character so don't allow this byte to be in entity style
- styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
- else
- styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+ } else if (!setEntity.Contains(ch)) {
+ // Only allow [A-Za-z0-9.#-_:] in entities
+ styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
state = SCE_H_DEFAULT;
}
break;
Scintilla bug 810 affected this code but the &—
case seems OK with this change.
https://sourceforge.net/p/scintilla/bugs/810/
how about adding continue
after SCE_H_TAGUNKNOWN
:
state = SCE_H_DEFAULT;
--i;
continue;
test case:
&<p>
&&
how about adding
continue
afterSCE_H_TAGUNKNOWN
:state = SCE_H_DEFAULT; --i; continue;
That works:
{2}&<p>{0}
{2}&{10}&{0}
After experimenting a bit, I prefer the current styling that shows the first invalid character in 'red' so it can be seen as the problem. If it is the start of a valid tag like <p>
then showing that all in blue makes it harder to see where the problem is.
Tests:
&
&<p>
&&
<
<<b>
&b.epsi;
&b.epsi!
&カタナ;
&—;
Treating line ends and non-ASCII as early ends (due to character/CRLF slicing):
@@ -1162,6 +1162,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true);
// TODO: also handle + and - (except if they're part of ++ or --) and return keywords
const CharacterSet setOKBeforeJSRE(CharacterSet::setNone, "([{=,:;!%^&*|?~");
+ const CharacterSet setEntity(CharacterSet::setAlphaNum, ".#-_:");
int levelPrev = styler.LevelAt(lineCurrent) & SC_FOLDLEVELNUMBERMASK;
int levelCurrent = levelPrev;
@@ -1872,13 +1873,12 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
if (ch == ';') {
styler.ColourTo(i, StateToPrint);
state = SCE_H_DEFAULT;
- }
- if (ch != '#' && !(IsASCII(ch) && isalnum(ch)) // Should check that '#' follows '&', but it is unlikely anyway...
- && ch != '.' && ch != '-' && ch != '_' && ch != ':') { // valid in XML
- if (!IsASCII(ch)) // Possibly start of a multibyte character so don't allow this byte to be in entity style
- styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
- else
- styler.ColourTo(i, SCE_H_TAGUNKNOWN);
+ } else if (isLineEnd(ch) || !IsASCII(ch)) {
+ styler.ColourTo(i-1, SCE_H_TAGUNKNOWN);
+ state = SCE_H_DEFAULT;
+ } else if (!setEntity.Contains(ch)) {
+ // Only allow [A-Za-z0-9.#-_:] in entities
+ styler.ColourTo(i, SCE_H_TAGUNKNOWN);
state = SCE_H_DEFAULT;
}
break;
{2}&{0}
{2}&<{0}p>
{2}&&{0}amp;
{10}<{0}
{10}<{1}<b>{0}
{10}&b.epsi;{0}
{2}&b.epsi!{0}
{2}&{0}カタナ;
{2}&{0}—;
with
} else if (!setEntity.Contains(ch)) {
styler.ColourTo(i - 1, SCE_H_TAGUNKNOWN);
state = SCE_H_DEFAULT;
--i;
continue;
}
<html>
&
&1
&A
&中
&<p>
&1<p>
&A<p>
&中<p>
&&

中
&A;<p>
<p>
中<p>
&
</html>
PS. I think it's might better to treat <p>
as void tag, https://html.spec.whatwg.org/multipage/grouping-content.html#the-p-element. also from https://www.w3.org/TR/xml-entity-names/#htmlmathml
the HTML (but not XML) parser allows the trailing semicolon to be omitted. So
&
may be used as well as&
when using HTML.
From https://html.spec.whatwg.org/multipage/syntax.html#character-references:
Named character references The name must be one that is terminated by a U+003B SEMICOLON character (;)
I think it is better to be strict here.
"Strict" really means "easier to implement" in this case. It does appear that some entities in the HTML 5 specification have no semicolon in their canonical form.
$ curl -sL 'https://www.w3.org/TR/html5/entities.json' | jq -r '. | keys[] | select(index(";") | not)'
Æ
&
Á
Â
À
Å
Ã
Ä
©
Ç
Ð
É
Ê
È
Ë
>
Í
Î
Ì
Ï
<
Ñ
Ó
Ô
Ò
Ø
Õ
Ö
"
®
Þ
Ú
Û
Ù
Ü
Ý
á
â
´
æ
à
&
å
ã
ä
¦
ç
¸
¢
©
¤
°
÷
é
ê
è
ð
ë
½
¼
¾
>
í
î
¡
ì
¿
ï
«
<
¯
µ
·
 
¬
ñ
ó
ô
ò
ª
º
ø
õ
ö
¶
±
£
"
»
®
§
­
¹
²
³
ß
þ
×
ú
û
ù
¨
ü
ý
¥
ÿ
But then, each of these also has a fully-delimited variant, so the most the implementation would lack is an optionally briefer syntax.
strict is fine. I prefer highlight only &
(instead of &<
) as SCE_H_TAGUNKNOWN: less code than checking eol and multi-byte, better error recovery (with --i
).
@rdipardo It does appear that some entities in the HTML 5 specification have no semicolon in their canonical form.
The HTML 5 specification is the same as WHATWG: https://dev.w3.org/html5/spec-LC/syntax.html#character-references
While an entity may be defined without a semicolon, the syntax doesn't appear to allow that to be used within HTML. If it was allowed then the syntax should be defined somewhere so lexing code knows where to stop and thus what to discard and where to restart.
Folding is more stable by retreating for an invalid entity character with --i; continue;
when adding an entity before a tag. Without this, the tag is temporarily invalid which means its fold level changes which may then lead to automatic unfolding. Because of this, the retreat code is included in 55c00616e7938b24a49aabfeb4755653e389051e.
SCE_H_TAGUNKNOWN
bleeds into the first non-printing character adjacent to a malformed entity:https://github.com/ScintillaOrg/lexilla/blob/5ffca5619467ae8beae9210b964eb2acdd431b29/lexers/LexHTML.cxx#L1876-L1883
In CRLF mode, this will split the EOL into different styles:
Text (
SC_EOL_CRLF
)Computed Styles
Visual Styles
Test Output
This patch introduces back-tracking to only style the visible characters of the malformed entity:
0001-192-Prevent-styling-EOLs-as-SCE_H_TAGUNKNOWN.patch.txt
0001-Test-Prevent-styling-EOLs-as-SCE_H_TAGUNKNOWN.diff.txt