ScintillaOrg / lexilla

A library of language lexers for use with Scintilla
https://www.scintilla.org/Lexilla.html
Other
186 stars 67 forks source link

[HTML/XML] Improve SGML/DTD highlighting #272

Closed zufuliu closed 1 month ago

zufuliu commented 2 months ago

Created for https://sourceforge.net/p/scintilla/feature-requests/1341/

DTD test files: https://www.w3.org/TR/xml11/#prod-notes contains https://www.w3.org/2002/xmlspec/dtd/2.10/xmlspec.dtd

https://www.w3.org/TR/html4/sgml/intro.html contains https://www.w3.org/TR/html4/strict.dtd and https://www.w3.org/TR/html4/loose.dtd

https://www.w3.org/TR/SVG11/svgdtd.html#Introduction contains https://www.w3.org/Graphics/SVG/1.1/DTD/svg11-flat-20110816.dtd

XML test files from https://www.w3.org/XML/Test/ https://www.w3.org/XML/Test/xmlts20130923.zip

xmlts20130923\xmlconf\sun\valid\pe03.xml

<!DOCTYPE root [
<!ELEMENT root (p)>
<!ELEMENT p (#PCDATA)>
<!-- Example 1 from XML spec 1.0 Appendix D -->
<!ENTITY example "<p>An ampersand (&#38;#38;) may be escaped
numerically (&#38;#38;#38) or with a general entity (&amp;amp;).</p>" >
]>
<root>&example;</root>

Here are three issues:

  1. Don't terminate SGML when > is inside quoted string (as shown in above XML).
  2. Color [ and ] as same style to make brace matching work.
  3. Nested conational section or sub-section <![xxx[ ... ]]>.
zufuliu commented 2 months ago
  1. Don't terminate SGML when > is inside quoted string (as shown in above XML).

SGML--terminate-0906.patch

https://github.com/ScintillaOrg/lexilla/blob/81a84bd4169d3c27947d2af15d2c126fca85be75/lexers/LexHTML.cxx#L1759-L1761

@@ -1757,7 +1757,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
        else if ((!isMako && !isDjango && ((inScriptType == eNonHtmlPreProc) || (inScriptType == eNonHtmlScriptPreProc)) &&
                  (((scriptLanguage != eScriptNone) && stateAllowsTermination(state))) &&
                  ((chNext == '>') && isPreProcessorEndTag(state, ch))) ||
-                ((scriptLanguage == eScriptSGML) && (ch == '>') && (state != SCE_H_SGML_COMMENT))) {
+                ((scriptLanguage == eScriptSGML) && (ch == '>') && !AnyOf(state, SCE_H_SGML_COMMENT, SCE_H_SGML_DOUBLESTRING, SCE_H_SGML_SIMPLESTRING))) {
            if (state == SCE_H_ASPAT) {
                aspScript = segIsScriptingIndicator(styler,
                                                    styler.GetStartSegment(), i - 1, aspScript);
nyamatongwe commented 2 months ago

There appears to be unstable lexing of the example, depending on where a lex starts leading to TestLexers reporting Issue272SGML.xml:4: has different per-line styles. Sometimes the SCE_H_SGML_DOUBLESTRING terminates at the > and sometimes it doesn't.

zufuliu commented 2 months ago

There appears to be unstable lexing of the example

Seems unrelated to the change, press Enter before <!ENTITY example already breaks styling (however Notepad4 is OK). image

Find another test file (for missing underscore in SCE_H_SGML_ENTITY): xmlts20130923\xmlconf\ibm\valid\P28\ibm28v02.xml

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE animal SYSTEM "ibm28v02.dtd" [
   <!NOTATION animal_class SYSTEM "ibm28v02.txt">
   <!ENTITY forcat "This is a small cat">
   <!ELEMENT tiger (#PCDATA)>
   <!ENTITY % make_small "<!ELEMENT small EMPTY>">
   <!ENTITY % make_leopard_element "<!ELEMENT leopard ANY>">
   <!ENTITY % make_attlist "<!ATTLIST tiger color CDATA #REQUIRED>">
   %make_leopard_element; 
   <!ELEMENT cat ANY>
   %make_small;
   <!ENTITY % make_big "<!ELEMENT big EMPTY>">
   %make_big;
   %make_attlist;
   <?sound "This is a PI" ?>
   <!-- This is a valid test file for p28 -->
]>
<animal>
   <cat>&forcat;</cat>
   <tiger color="white">This is a white tiger in Mirage!!</tiger>
   <cat/>
   <leopard>
      <small/>
      <big/>
   </leopard>
</animal>
zufuliu commented 2 months ago

HTML-SGML-DTD-0911.patch

A combined patch that fixed all bug (except nested section <![xxx[ ... ]]>, but include unstable lexing) , Note <!-- comment --> inside eScriptSGMLblock [ ] is changed to SCE_H_COMMENT instead of <! SCE_H_SGML_DEFAULT !> + -- SCE_H_SGML_COMMENT --, e.g. mostly treat SCE_H_SGML_BLOCK_DEFAULT as SCE_H_DEFAULT, treat content inside [ ] as included from external DTD file.

nyamatongwe commented 2 months ago

It may be possible to preserve the current approach more with this patch that rewinds when a lex starts inside an SGML string. InTagState may be renamed to something like InRewindState.

Rewind.patch

It makes more SGML styles revert to SCE_H_SGML_BLOCK_DEFAULT for eScriptSGMLblock. This would actually call defaultStateForSGML but that would be introduced in a preparatory commit.

There's also a change for empty word checking in isWordHSGML as that causes assertion failures.

diff --git a/lexers/LexHTML.cxx b/lexers/LexHTML.cxx
index 9bfb63e4..a15b6e96 100644
--- a/lexers/LexHTML.cxx
+++ b/lexers/LexHTML.cxx
@@ -461,6 +461,8 @@ void classifyWordHTPHP(Sci_PositionU start, Sci_PositionU end, const WordList &k
 }

 bool isWordHSGML(Sci_PositionU start, Sci_PositionU end, const WordList &keywords, const Accessor &styler) {
+   if (end < start)
+       return false;
    const std::string s = styler.GetRange(start, end + 1);
    return keywords.InList(s);
 }
@@ -515,7 +517,8 @@ constexpr bool InTagState(int state) noexcept {
    return AnyOf(state, SCE_H_TAG, SCE_H_TAGUNKNOWN, SCE_H_SCRIPT,
           SCE_H_ATTRIBUTE, SCE_H_ATTRIBUTEUNKNOWN,
           SCE_H_NUMBER, SCE_H_OTHER,
-          SCE_H_DOUBLESTRING, SCE_H_SINGLESTRING);
+          SCE_H_DOUBLESTRING, SCE_H_SINGLESTRING,
+          SCE_H_SGML_DOUBLESTRING, SCE_H_SGML_SIMPLESTRING);
 }

 constexpr bool IsCommentState(const int state) noexcept {
@@ -1212,7 +1215,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
            length += startPos - backLineStart;
            startPos = backLineStart;
        }
-       state = SCE_H_DEFAULT;
+       state = (startPos == 0) ? SCE_H_DEFAULT : styler.StyleIndexAt(startPos - 1);
    }
    // String can be heredoc, must find a delimiter first. Reread from beginning of line containing the string, to get the correct lineState
    if (isPHPStringState(state)) {
@@ -1941,7 +1944,11 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
        case SCE_H_SGML_DOUBLESTRING:
            if (ch == '\"') {
                styler.ColourTo(i, StateToPrint);
-               state = SCE_H_SGML_DEFAULT;
+               if (scriptLanguage == eScriptSGMLblock) {
+                   state = SCE_H_SGML_BLOCK_DEFAULT;
+               } else {
+                   state = SCE_H_SGML_DEFAULT;
+               }
            }
            break;
        case SCE_H_SGML_SIMPLESTRING:
@@ -1953,7 +1960,11 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
        case SCE_H_SGML_COMMENT:
            if ((ch == '-') && (chPrev == '-')) {
                styler.ColourTo(i, StateToPrint);
-               state = SCE_H_SGML_DEFAULT;
+               if (scriptLanguage == eScriptSGMLblock) {
+                   state = SCE_H_SGML_BLOCK_DEFAULT;
+               } else {
+                   state = SCE_H_SGML_DEFAULT;
+               }
            }
            break;
        case SCE_H_CDATA:
@@ -1987,7 +1998,11 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
                if (IsAlphaNumeric(ch)) {
                    state = SCE_H_SGML_ERROR;
                } else {
-                   state = SCE_H_SGML_DEFAULT;
+                   if (scriptLanguage == eScriptSGMLblock) {
+                       state = SCE_H_SGML_BLOCK_DEFAULT;
+                   } else {
+                       state = SCE_H_SGML_DEFAULT;
+                   }
                }
            }
            break;
zufuliu commented 2 months ago

Rewind.patch does fix the unstable lexing, my approach makes things clean:

  1. <! > block is eScriptSGML.
  2. [ ] inside eScriptSGML starts eScriptSGMLblock.
  3. <! > inside eScriptSGMLblock starts inner eScriptSGML.
zufuliu commented 2 months ago

HTML-SGML-DTD-0912.patch Whole patch with nested section, backtracking for SGML is not required. image

nyamatongwe commented 2 months ago

In the SGML and DTD documentation there doesn't appear to be a special type of comment other than standard XML comments so the intention of SCE_H_SGML_COMMENT is to style comments within SGML blocks and that was its previous behaviour. This may make an initial DTD preamble appear more like a separate piece from the main document. The patch uses SCE_H_COMMENT for comments inside SGML blocks instead.

The previous styling of these comments can be reverted with something like adding this around line 1688:

state = (scriptLanguage == eScriptSGMLblock) ? SCE_H_SGML_COMMENT : SCE_H_COMMENT;

and then returning with state = beforePreProc; when terminating SCE_H_SGML_COMMENT.

Without this SCE_H_SGML_COMMENT is no longer useful although it can still be reached with text like this invalid example:

<!DOCTYPE root [
<!ELEMENT root -- see $372 -- (p)>
]>
zufuliu commented 2 months ago

Without this SCE_H_SGML_COMMENT is no longer useful although it can still be reached with text like this invalid example:

this kind of code is used in HTML4's strict.dtd and loose.dtd. image

Patch reverted to use SCE_H_SGML_COMMENT. HTML-SGML-DTD-0913.patch

image

Trying to move the changes inside handle the start of SGML language (DTD) into else if (scriptLanguage == eScriptSGMLblock), the code does not work, so I removed this unreachable block. https://github.com/ScintillaOrg/lexilla/blob/75ae9070e93c6476e17bf461855753b6559ca544/lexers/LexHTML.cxx#L1880-L1890

Personally, for SVG DTD, it looks better to highlight inner comment <!-- SVG 1.1 Scripting Module .............................................. --> with same color as outer comment <!-- Scripting Module ............................................ -->, as in screenshot in previous post.

zufuliu commented 2 months ago

There's also a change for empty word checking in isWordHSGML as that causes assertion failures.

even with the assertion fix, here still has bug when not highlighting inner <!-- --> as SCE_H_COMMENT: image

zufuliu commented 2 months ago

HTML-SGML-DTD-0915.patch

here still has bug when not highlighting inner <!-- --> as SCE_H_COMMENT:

Following change fixed the assertion failure and comment highlighting bug.

-               state = SCE_H_SGML_COMMAND; // wait for a pending command
+               if ((chNext == '-') && (chNext2 == '-')) {
+                   i += 2;
+                   state = SCE_H_SGML_COMMENT;
+               } else {
+                   state = (chNext == '[') ? SCE_H_SGML_DEFAULT : SCE_H_SGML_COMMAND; // wait for a pending command
+               }
zufuliu commented 1 month ago

HTML-SGML-0923.patch

The condition to enter SGML can be simplified:

@@ -1678,10 +1674,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
        else if (AnyOf(scriptLanguage, eScriptNone, eScriptXML, eScriptSGMLblock) &&
                 (chPrev == '<') &&
                 (ch == '!') &&
-                (StateToPrint != SCE_H_CDATA) &&
-                (!isStringState(StateToPrint)) &&
-                (!IsCommentState(StateToPrint)) &&
-                (!IsScriptCommentState(StateToPrint))) {
+                AnyOf(state, SCE_H_DEFAULT, SCE_H_SGML_BLOCK_DEFAULT)) {
            beforePreProc = state;
            styler.ColourTo(i - 2, StateToPrint);
            if ((state != SCE_H_SGML_BLOCK_DEFAULT) && (chNext == '-') && (chNext2 == '-')) {

Also reverted unintended change to case SCE_H_SGML_1ST_PARAM:

@@ -1910,7 +1903,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
            break;
        case SCE_H_SGML_1ST_PARAM:
            // wait for the beginning of the word
-           if ((ch == '-') && (chNext == '-')) {
+           if ((ch == '-') && (chPrev == '-')) {
                styler.ColourTo(i - 2, SCE_H_SGML_DEFAULT);
                state = SCE_H_SGML_1ST_PARAM_COMMENT;
            } else if (issgmlwordchar(ch)) {