Closed zufuliu closed 1 month ago
- Don't terminate SGML when
>
is inside quoted string (as shown in above XML).
@@ -1757,7 +1757,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
else if ((!isMako && !isDjango && ((inScriptType == eNonHtmlPreProc) || (inScriptType == eNonHtmlScriptPreProc)) &&
(((scriptLanguage != eScriptNone) && stateAllowsTermination(state))) &&
((chNext == '>') && isPreProcessorEndTag(state, ch))) ||
- ((scriptLanguage == eScriptSGML) && (ch == '>') && (state != SCE_H_SGML_COMMENT))) {
+ ((scriptLanguage == eScriptSGML) && (ch == '>') && !AnyOf(state, SCE_H_SGML_COMMENT, SCE_H_SGML_DOUBLESTRING, SCE_H_SGML_SIMPLESTRING))) {
if (state == SCE_H_ASPAT) {
aspScript = segIsScriptingIndicator(styler,
styler.GetStartSegment(), i - 1, aspScript);
There appears to be unstable lexing of the example, depending on where a lex starts leading to TestLexers reporting Issue272SGML.xml:4: has different per-line styles
. Sometimes the SCE_H_SGML_DOUBLESTRING
terminates at the >
and sometimes it doesn't.
There appears to be unstable lexing of the example
Seems unrelated to the change, press Enter before <!ENTITY example
already breaks styling (however Notepad4 is OK).
Find another test file (for missing underscore in SCE_H_SGML_ENTITY
):
xmlts20130923\xmlconf\ibm\valid\P28\ibm28v02.xml
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE animal SYSTEM "ibm28v02.dtd" [
<!NOTATION animal_class SYSTEM "ibm28v02.txt">
<!ENTITY forcat "This is a small cat">
<!ELEMENT tiger (#PCDATA)>
<!ENTITY % make_small "<!ELEMENT small EMPTY>">
<!ENTITY % make_leopard_element "<!ELEMENT leopard ANY>">
<!ENTITY % make_attlist "<!ATTLIST tiger color CDATA #REQUIRED>">
%make_leopard_element;
<!ELEMENT cat ANY>
%make_small;
<!ENTITY % make_big "<!ELEMENT big EMPTY>">
%make_big;
%make_attlist;
<?sound "This is a PI" ?>
<!-- This is a valid test file for p28 -->
]>
<animal>
<cat>&forcat;</cat>
<tiger color="white">This is a white tiger in Mirage!!</tiger>
<cat/>
<leopard>
<small/>
<big/>
</leopard>
</animal>
A combined patch that fixed all bug (except nested section <![xxx[ ... ]]>
, but include unstable lexing) , Note <!-- comment -->
inside eScriptSGMLblock [ ]
is changed to SCE_H_COMMENT
instead of <! SCE_H_SGML_DEFAULT !>
+ -- SCE_H_SGML_COMMENT --
, e.g. mostly treat SCE_H_SGML_BLOCK_DEFAULT
as SCE_H_DEFAULT
, treat content inside [ ]
as included from external DTD file.
It may be possible to preserve the current approach more with this patch that rewinds when a lex starts inside an SGML string. InTagState
may be renamed to something like InRewindState
.
It makes more SGML styles revert to SCE_H_SGML_BLOCK_DEFAULT
for eScriptSGMLblock
. This would actually call defaultStateForSGML
but that would be introduced in a preparatory commit.
There's also a change for empty word checking in isWordHSGML
as that causes assertion failures.
diff --git a/lexers/LexHTML.cxx b/lexers/LexHTML.cxx
index 9bfb63e4..a15b6e96 100644
--- a/lexers/LexHTML.cxx
+++ b/lexers/LexHTML.cxx
@@ -461,6 +461,8 @@ void classifyWordHTPHP(Sci_PositionU start, Sci_PositionU end, const WordList &k
}
bool isWordHSGML(Sci_PositionU start, Sci_PositionU end, const WordList &keywords, const Accessor &styler) {
+ if (end < start)
+ return false;
const std::string s = styler.GetRange(start, end + 1);
return keywords.InList(s);
}
@@ -515,7 +517,8 @@ constexpr bool InTagState(int state) noexcept {
return AnyOf(state, SCE_H_TAG, SCE_H_TAGUNKNOWN, SCE_H_SCRIPT,
SCE_H_ATTRIBUTE, SCE_H_ATTRIBUTEUNKNOWN,
SCE_H_NUMBER, SCE_H_OTHER,
- SCE_H_DOUBLESTRING, SCE_H_SINGLESTRING);
+ SCE_H_DOUBLESTRING, SCE_H_SINGLESTRING,
+ SCE_H_SGML_DOUBLESTRING, SCE_H_SGML_SIMPLESTRING);
}
constexpr bool IsCommentState(const int state) noexcept {
@@ -1212,7 +1215,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
length += startPos - backLineStart;
startPos = backLineStart;
}
- state = SCE_H_DEFAULT;
+ state = (startPos == 0) ? SCE_H_DEFAULT : styler.StyleIndexAt(startPos - 1);
}
// String can be heredoc, must find a delimiter first. Reread from beginning of line containing the string, to get the correct lineState
if (isPHPStringState(state)) {
@@ -1941,7 +1944,11 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
case SCE_H_SGML_DOUBLESTRING:
if (ch == '\"') {
styler.ColourTo(i, StateToPrint);
- state = SCE_H_SGML_DEFAULT;
+ if (scriptLanguage == eScriptSGMLblock) {
+ state = SCE_H_SGML_BLOCK_DEFAULT;
+ } else {
+ state = SCE_H_SGML_DEFAULT;
+ }
}
break;
case SCE_H_SGML_SIMPLESTRING:
@@ -1953,7 +1960,11 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
case SCE_H_SGML_COMMENT:
if ((ch == '-') && (chPrev == '-')) {
styler.ColourTo(i, StateToPrint);
- state = SCE_H_SGML_DEFAULT;
+ if (scriptLanguage == eScriptSGMLblock) {
+ state = SCE_H_SGML_BLOCK_DEFAULT;
+ } else {
+ state = SCE_H_SGML_DEFAULT;
+ }
}
break;
case SCE_H_CDATA:
@@ -1987,7 +1998,11 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
if (IsAlphaNumeric(ch)) {
state = SCE_H_SGML_ERROR;
} else {
- state = SCE_H_SGML_DEFAULT;
+ if (scriptLanguage == eScriptSGMLblock) {
+ state = SCE_H_SGML_BLOCK_DEFAULT;
+ } else {
+ state = SCE_H_SGML_DEFAULT;
+ }
}
}
break;
Rewind.patch does fix the unstable lexing, my approach makes things clean:
<! >
block is eScriptSGML
.[ ]
inside eScriptSGML
starts eScriptSGMLblock
.<! >
inside eScriptSGMLblock
starts inner eScriptSGML
.HTML-SGML-DTD-0912.patch Whole patch with nested section, backtracking for SGML is not required.
In the SGML and DTD documentation there doesn't appear to be a special type of comment other than standard XML comments so the intention of SCE_H_SGML_COMMENT
is to style comments within SGML blocks and that was its previous behaviour. This may make an initial DTD preamble appear more like a separate piece from the main document. The patch uses SCE_H_COMMENT
for comments inside SGML blocks instead.
The previous styling of these comments can be reverted with something like adding this around line 1688:
state = (scriptLanguage == eScriptSGMLblock) ? SCE_H_SGML_COMMENT : SCE_H_COMMENT;
and then returning with state = beforePreProc;
when terminating SCE_H_SGML_COMMENT
.
Without this SCE_H_SGML_COMMENT
is no longer useful although it can still be reached with text like this invalid example:
<!DOCTYPE root [
<!ELEMENT root -- see $372 -- (p)>
]>
Without this
SCE_H_SGML_COMMENT
is no longer useful although it can still be reached with text like this invalid example:
this kind of code is used in HTML4's strict.dtd and loose.dtd.
Patch reverted to use SCE_H_SGML_COMMENT
.
HTML-SGML-DTD-0913.patch
Trying to move the changes inside handle the start of SGML language (DTD)
into else if (scriptLanguage == eScriptSGMLblock)
, the code does not work, so I removed this unreachable block.
https://github.com/ScintillaOrg/lexilla/blob/75ae9070e93c6476e17bf461855753b6559ca544/lexers/LexHTML.cxx#L1880-L1890
Personally, for SVG DTD, it looks better to highlight inner comment <!-- SVG 1.1 Scripting Module .............................................. -->
with same color as outer comment <!-- Scripting Module ............................................ -->
, as in screenshot in previous post.
There's also a change for empty word checking in
isWordHSGML
as that causes assertion failures.
even with the assertion fix, here still has bug when not highlighting inner <!-- -->
as SCE_H_COMMENT
:
here still has bug when not highlighting inner
<!-- -->
asSCE_H_COMMENT
:
Following change fixed the assertion failure and comment highlighting bug.
- state = SCE_H_SGML_COMMAND; // wait for a pending command
+ if ((chNext == '-') && (chNext2 == '-')) {
+ i += 2;
+ state = SCE_H_SGML_COMMENT;
+ } else {
+ state = (chNext == '[') ? SCE_H_SGML_DEFAULT : SCE_H_SGML_COMMAND; // wait for a pending command
+ }
The condition to enter SGML can be simplified:
@@ -1678,10 +1674,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
else if (AnyOf(scriptLanguage, eScriptNone, eScriptXML, eScriptSGMLblock) &&
(chPrev == '<') &&
(ch == '!') &&
- (StateToPrint != SCE_H_CDATA) &&
- (!isStringState(StateToPrint)) &&
- (!IsCommentState(StateToPrint)) &&
- (!IsScriptCommentState(StateToPrint))) {
+ AnyOf(state, SCE_H_DEFAULT, SCE_H_SGML_BLOCK_DEFAULT)) {
beforePreProc = state;
styler.ColourTo(i - 2, StateToPrint);
if ((state != SCE_H_SGML_BLOCK_DEFAULT) && (chNext == '-') && (chNext2 == '-')) {
Also reverted unintended change to case SCE_H_SGML_1ST_PARAM:
@@ -1910,7 +1903,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
break;
case SCE_H_SGML_1ST_PARAM:
// wait for the beginning of the word
- if ((ch == '-') && (chNext == '-')) {
+ if ((ch == '-') && (chPrev == '-')) {
styler.ColourTo(i - 2, SCE_H_SGML_DEFAULT);
state = SCE_H_SGML_1ST_PARAM_COMMENT;
} else if (issgmlwordchar(ch)) {
Created for https://sourceforge.net/p/scintilla/feature-requests/1341/
DTD test files: https://www.w3.org/TR/xml11/#prod-notes contains https://www.w3.org/2002/xmlspec/dtd/2.10/xmlspec.dtd
https://www.w3.org/TR/html4/sgml/intro.html contains https://www.w3.org/TR/html4/strict.dtd and https://www.w3.org/TR/html4/loose.dtd
https://www.w3.org/TR/SVG11/svgdtd.html#Introduction contains https://www.w3.org/Graphics/SVG/1.1/DTD/svg11-flat-20110816.dtd
XML test files from https://www.w3.org/XML/Test/ https://www.w3.org/XML/Test/xmlts20130923.zip
xmlts20130923\xmlconf\sun\valid\pe03.xml
Here are three issues:
>
is inside quoted string (as shown in above XML).[
and]
as same style to make brace matching work.<![xxx[ ... ]]>
.