ScintillaOrg / lexilla

A library of language lexers for use with Scintilla
https://www.scintilla.org/Lexilla.html
Other
175 stars 64 forks source link

[HTML] Fix XML processing instruction handling #269

Open zufuliu opened 2 weeks ago

zufuliu commented 2 weeks ago

PrintScriptingIndicatorOffset-0828.patch

First place is PrintScriptingIndicatorOffset(styler, styler.GetStartSegment() + 2, i + 6);, it can be fixed by change PrintScriptingIndicatorOffset() to following:

int PrintScriptingIndicatorOffset(Accessor &styler, Sci_PositionU start) {
    int iResult = 0;
    if (styler.MatchIgnoreCase(start, "php")) {
        iResult = 3;
    }
    return iResult;
}

segIsScriptingIndicator-0828.patch

Second place is scriptLanguage = segIsScriptingIndicator(styler, i + 2, i + 6, isXml ? eScriptXML : eScriptPHP);, as it only make sense to handle <?php and <?xml at this position (rest of segIsScriptingIndicator() is used to detect script language from language or type attribute value), it seems better to move Contains(s, "php") and Contains(s, "xml") cases to a new function, e.g.:

script_type segIsScriptingIndicator(Accessor &styler, Sci_PositionU start, bool isXml) {
    if (styler.MatchIgnoreCase(start, "php")) {
        return eScriptPHP;
    }
    if (styler.MatchIgnoreCase(start, "xml")) {
        return eScriptXML;
    }
    return isXml ? eScriptXML : eScriptPHP;
}

I don't know the purpose of checking space after xml, or any XML processing instruction with xml prefix, so removed IsASpace(). <script language="php"></script> was removed in PHP 7 (see https://wiki.php.net/rfc/remove_alternative_php_tags), so Contains(s, "php") can be removed from old segIsScriptingIndicator().

nyamatongwe commented 2 weeks ago

It's probably worth improving out-of-range behaviour to be more reasonable. That makes it easier to treat end of document the same as other positions. For the string-returning GetRange and GetRangeLowered, restricting the end to the end of the document should be OK.

std::string LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_) {
    const Sci_PositionU endRange = std::min(endPos_, static_cast<Sci_PositionU>(lenDoc));
    assert(startPos_ < endRange);
    const Sci_PositionU len = endRange - startPos_;
    std::string s(len, '\0');
    GetRangeLowered(startPos_, endRange, s.data(), len + 1);
    return s;
}

For the char-buffer writing versions, filling the array with NUL then retrieving as much as possible may be OK.

zufuliu commented 2 weeks ago

or any XML processing instruction with xml prefix, so removed IsASpace().

from https://www.w3.org/TR/xml11/#sec-pi, xml prefixed processing instructions are reserved.

The target names "XML", "xml", and so on are reserved for standardization in this or future versions of this specification.

Currently here is <?xml-stylesheet ?>, https://www.w3.org/TR/xml-stylesheet/

<?xml-stylesheet href="common.css"?>
<?xml-stylesheet href="default.css" title="Default style"?>
<?xml-stylesheet alternate="yes" href="alt.css" title="Alternative style"?>
<?xml-stylesheet href="single-col.css" media="all and (max-width: 30em)"?>
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>Example with xml-stylesheet processing instructions</title>
 </head>
 <body>
  ...
 </body>
</html>

I think it (and other xml prefixed instructions) should be handled same as <?xml version="1.0" encoding="utf-8"?>, so IsASpace() can be removed.


For the char-buffer writing versions, filling the array with NUL then retrieving as much as possible may be OK.

changes like following? it will do some cheap redundant works already done for string-returning versions.

@@ -32,7 +32,9 @@ bool LexAccessor::MatchIgnoreCase(Sci_Position pos, const char *s) {
 void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) {
    assert(s);
    assert(startPos_ <= endPos_ && len != 0);
+   memset(s, '\0', len);
    endPos_ = std::min(endPos_, startPos_ + len - 1);
+   endPos_ = std::min(endPos_, static_cast<Sci_PositionU>(lenDoc));
    len = endPos_ - startPos_;
    if (startPos_ >= static_cast<Sci_PositionU>(startPos) && endPos_ <= static_cast<Sci_PositionU>(endPos)) {
        const char * const p = buf + (startPos_ - startPos);
@@ -40,7 +42,6 @@ void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char
    } else {
        pAccess->GetCharRange(s, startPos_, len);
    }
-   s[len] = '\0';
 }

 void LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) {
nyamatongwe commented 2 weeks ago

The original 'xml' commits are 1 and 2.

zufuliu commented 2 weeks ago

LexAccessor-0829.patch

Changed all the four function to const and truncate endPos_ to lenDoc.

```diff diff --git a/lexlib/LexAccessor.cxx b/lexlib/LexAccessor.cxx index f0d6bdf8..f8d1ccac 100644 --- a/lexlib/LexAccessor.cxx +++ b/lexlib/LexAccessor.cxx @@ -29,10 +29,12 @@ bool LexAccessor::MatchIgnoreCase(Sci_Position pos, const char *s) { return true; } -void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) { +void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) const { assert(s); assert(startPos_ <= endPos_ && len != 0); + memset(s, '\0', len); endPos_ = std::min(endPos_, startPos_ + len - 1); + endPos_ = std::min(endPos_, static_cast(lenDoc)); len = endPos_ - startPos_; if (startPos_ >= static_cast(startPos) && endPos_ <= static_cast(endPos)) { const char * const p = buf + (startPos_ - startPos); @@ -40,10 +42,9 @@ void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char } else { pAccess->GetCharRange(s, startPos_, len); } - s[len] = '\0'; } -void LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) { +void LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) const { assert(s); GetRange(startPos_, endPos_, s, len); while (*s) { @@ -54,16 +55,18 @@ void LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_ } } -std::string LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_) { +std::string LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_) const { assert(startPos_ < endPos_); + endPos_ = std::min(endPos_, static_cast(lenDoc)); const Sci_PositionU len = endPos_ - startPos_; std::string s(len, '\0'); GetRange(startPos_, endPos_, s.data(), len + 1); return s; } -std::string LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_) { +std::string LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_) const { assert(startPos_ < endPos_); + endPos_ = std::min(endPos_, static_cast(lenDoc)); const Sci_PositionU len = endPos_ - startPos_; std::string s(len, '\0'); GetRangeLowered(startPos_, endPos_, s.data(), len + 1); diff --git a/lexlib/LexAccessor.h b/lexlib/LexAccessor.h index 80df9a66..73fed173 100644 --- a/lexlib/LexAccessor.h +++ b/lexlib/LexAccessor.h @@ -116,11 +116,11 @@ public: bool MatchIgnoreCase(Sci_Position pos, const char *s); // Get first len - 1 characters in range [startPos_, endPos_). - void GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len); - void GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len); + void GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) const; + void GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) const; // Get all characters in range [startPos_, endPos_). - std::string GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_); - std::string GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_); + std::string GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_) const; + std::string GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_) const; char StyleAt(Sci_Position position) const { return pAccess->StyleAt(position); ```

The original 'xml' commits are 1 and 2.

Not going to change segIsScriptingIndicator(), as here is no test for space before xml (if (!IsASpace(s[t])) block is not reachable in all existing tests). though (Contains(s, "xml") block can be optimized to avoid second find(). PrintScriptingIndicatorOffset-0828.patch changes for PrintScriptingIndicatorOffset() is safe and simpler than origin code.

zufuliu commented 2 weeks ago

StyleContext-0830.patch

StyleContext::GetCurrent() can also be marked as const.


(Contains(s, "xml") block can be optimized to avoid second find()

    if (Contains(s, "php"))
        return eScriptPHP;
    {
        const size_t xml = s.find("xml");
        if (xml != std::string::npos) {
            for (size_t t = 0; t < xml; t++) {
                if (!IsASpace(s[t])) {
                    return prevValue;
                }
            }
            return eScriptXML;
        }
    }
@@ -103,7 +103,7 @@ script_type segIsScriptingIndicator(const Accessor &styler, Sci_PositionU start,
        return eScriptJS;
    if (Contains(s, "php"))
        return eScriptPHP;
-   if (Contains(s, "xml")) {
+   {
        const size_t xml = s.find("xml");
        if (xml != std::string::npos) {
            for (size_t t = 0; t < xml; t++) {
@@ -111,8 +111,8 @@ script_type segIsScriptingIndicator(const Accessor &styler, Sci_PositionU start,
                    return prevValue;
                }
            }
+           return eScriptXML;
        }
-       return eScriptXML;
    }

    return prevValue;
zufuliu commented 1 week ago

it seems better to move Contains(s, "php") and Contains(s, "xml") cases to a new function, e.g.:

Something like following, not sure whether worth the duplication (keep segIsScriptingIndicator() unchanged).

ScriptInstruction.patch

@@ -117,6 +117,16 @@ script_type segIsScriptingIndicator(const Accessor &styler, Sci_PositionU start,
    return prevValue;
 }

+script_type segIsScriptInstruction(Accessor &styler, Sci_PositionU start, bool isXml) {
+   if (styler.MatchIgnoreCase(start, "php")) {
+       return eScriptPHP;
+   }
+   if (isXml || styler.MatchIgnoreCase(start, "xml")) {
+       return eScriptXML;
+   }
+   return eScriptPHP;
+}
+
 int PrintScriptingIndicatorOffset(Accessor &styler, Sci_PositionU start) {
    return styler.MatchIgnoreCase(start, "php") ? 3 : 0;
 }
@@ -1492,7 +1502,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
        // handle the start of PHP pre-processor = Non-HTML
        else if ((ch == '<') && (chNext == '?') && IsPHPEntryState(state) && IsPHPStart(allowPHP, styler, i)) {
            beforeLanguage = scriptLanguage;
-           scriptLanguage = segIsScriptingIndicator(styler, i + 2, i + 6, isXml ? eScriptXML : eScriptPHP);
+           scriptLanguage = segIsScriptInstruction(styler, i + 2, isXml);
            if ((scriptLanguage != eScriptPHP) && (isStringState(state) || (state==SCE_H_COMMENT))) continue;
            styler.ColourTo(i - 1, StateToPrint);
            beforePreProc = state;
zufuliu commented 1 week ago

if ((scriptLanguage != eScriptPHP) && (isStringState(state) || (state==SCE_H_COMMENT))) continue; needs extra fix for <?xml or non-preprocessor inside string or comment.