ScintillaOrg / lexilla

A library of language lexers for use with Scintilla
https://www.scintilla.org/Lexilla.html
Other
186 stars 67 forks source link

[HTML] Fix XML processing instruction handling #269

Closed zufuliu closed 1 month ago

zufuliu commented 2 months ago

PrintScriptingIndicatorOffset-0828.patch

First place is PrintScriptingIndicatorOffset(styler, styler.GetStartSegment() + 2, i + 6);, it can be fixed by change PrintScriptingIndicatorOffset() to following:

int PrintScriptingIndicatorOffset(Accessor &styler, Sci_PositionU start) {
    int iResult = 0;
    if (styler.MatchIgnoreCase(start, "php")) {
        iResult = 3;
    }
    return iResult;
}

segIsScriptingIndicator-0828.patch

Second place is scriptLanguage = segIsScriptingIndicator(styler, i + 2, i + 6, isXml ? eScriptXML : eScriptPHP);, as it only make sense to handle <?php and <?xml at this position (rest of segIsScriptingIndicator() is used to detect script language from language or type attribute value), it seems better to move Contains(s, "php") and Contains(s, "xml") cases to a new function, e.g.:

script_type segIsScriptingIndicator(Accessor &styler, Sci_PositionU start, bool isXml) {
    if (styler.MatchIgnoreCase(start, "php")) {
        return eScriptPHP;
    }
    if (styler.MatchIgnoreCase(start, "xml")) {
        return eScriptXML;
    }
    return isXml ? eScriptXML : eScriptPHP;
}

I don't know the purpose of checking space after xml, or any XML processing instruction with xml prefix, so removed IsASpace(). <script language="php"></script> was removed in PHP 7 (see https://wiki.php.net/rfc/remove_alternative_php_tags), so Contains(s, "php") can be removed from old segIsScriptingIndicator().

nyamatongwe commented 2 months ago

It's probably worth improving out-of-range behaviour to be more reasonable. That makes it easier to treat end of document the same as other positions. For the string-returning GetRange and GetRangeLowered, restricting the end to the end of the document should be OK.

std::string LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_) {
    const Sci_PositionU endRange = std::min(endPos_, static_cast<Sci_PositionU>(lenDoc));
    assert(startPos_ < endRange);
    const Sci_PositionU len = endRange - startPos_;
    std::string s(len, '\0');
    GetRangeLowered(startPos_, endRange, s.data(), len + 1);
    return s;
}

For the char-buffer writing versions, filling the array with NUL then retrieving as much as possible may be OK.

zufuliu commented 2 months ago

or any XML processing instruction with xml prefix, so removed IsASpace().

from https://www.w3.org/TR/xml11/#sec-pi, xml prefixed processing instructions are reserved.

The target names "XML", "xml", and so on are reserved for standardization in this or future versions of this specification.

Currently here is <?xml-stylesheet ?>, https://www.w3.org/TR/xml-stylesheet/

<?xml-stylesheet href="common.css"?>
<?xml-stylesheet href="default.css" title="Default style"?>
<?xml-stylesheet alternate="yes" href="alt.css" title="Alternative style"?>
<?xml-stylesheet href="single-col.css" media="all and (max-width: 30em)"?>
<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
  <title>Example with xml-stylesheet processing instructions</title>
 </head>
 <body>
  ...
 </body>
</html>

I think it (and other xml prefixed instructions) should be handled same as <?xml version="1.0" encoding="utf-8"?>, so IsASpace() can be removed.


For the char-buffer writing versions, filling the array with NUL then retrieving as much as possible may be OK.

changes like following? it will do some cheap redundant works already done for string-returning versions.

@@ -32,7 +32,9 @@ bool LexAccessor::MatchIgnoreCase(Sci_Position pos, const char *s) {
 void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) {
    assert(s);
    assert(startPos_ <= endPos_ && len != 0);
+   memset(s, '\0', len);
    endPos_ = std::min(endPos_, startPos_ + len - 1);
+   endPos_ = std::min(endPos_, static_cast<Sci_PositionU>(lenDoc));
    len = endPos_ - startPos_;
    if (startPos_ >= static_cast<Sci_PositionU>(startPos) && endPos_ <= static_cast<Sci_PositionU>(endPos)) {
        const char * const p = buf + (startPos_ - startPos);
@@ -40,7 +42,6 @@ void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char
    } else {
        pAccess->GetCharRange(s, startPos_, len);
    }
-   s[len] = '\0';
 }

 void LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) {
nyamatongwe commented 2 months ago

The original 'xml' commits are 1 and 2.

zufuliu commented 2 months ago

LexAccessor-0829.patch

Changed all the four function to const and truncate endPos_ to lenDoc.

```diff diff --git a/lexlib/LexAccessor.cxx b/lexlib/LexAccessor.cxx index f0d6bdf8..f8d1ccac 100644 --- a/lexlib/LexAccessor.cxx +++ b/lexlib/LexAccessor.cxx @@ -29,10 +29,12 @@ bool LexAccessor::MatchIgnoreCase(Sci_Position pos, const char *s) { return true; } -void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) { +void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) const { assert(s); assert(startPos_ <= endPos_ && len != 0); + memset(s, '\0', len); endPos_ = std::min(endPos_, startPos_ + len - 1); + endPos_ = std::min(endPos_, static_cast(lenDoc)); len = endPos_ - startPos_; if (startPos_ >= static_cast(startPos) && endPos_ <= static_cast(endPos)) { const char * const p = buf + (startPos_ - startPos); @@ -40,10 +42,9 @@ void LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char } else { pAccess->GetCharRange(s, startPos_, len); } - s[len] = '\0'; } -void LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) { +void LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) const { assert(s); GetRange(startPos_, endPos_, s, len); while (*s) { @@ -54,16 +55,18 @@ void LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_ } } -std::string LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_) { +std::string LexAccessor::GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_) const { assert(startPos_ < endPos_); + endPos_ = std::min(endPos_, static_cast(lenDoc)); const Sci_PositionU len = endPos_ - startPos_; std::string s(len, '\0'); GetRange(startPos_, endPos_, s.data(), len + 1); return s; } -std::string LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_) { +std::string LexAccessor::GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_) const { assert(startPos_ < endPos_); + endPos_ = std::min(endPos_, static_cast(lenDoc)); const Sci_PositionU len = endPos_ - startPos_; std::string s(len, '\0'); GetRangeLowered(startPos_, endPos_, s.data(), len + 1); diff --git a/lexlib/LexAccessor.h b/lexlib/LexAccessor.h index 80df9a66..73fed173 100644 --- a/lexlib/LexAccessor.h +++ b/lexlib/LexAccessor.h @@ -116,11 +116,11 @@ public: bool MatchIgnoreCase(Sci_Position pos, const char *s); // Get first len - 1 characters in range [startPos_, endPos_). - void GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len); - void GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len); + void GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) const; + void GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_, char *s, Sci_PositionU len) const; // Get all characters in range [startPos_, endPos_). - std::string GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_); - std::string GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_); + std::string GetRange(Sci_PositionU startPos_, Sci_PositionU endPos_) const; + std::string GetRangeLowered(Sci_PositionU startPos_, Sci_PositionU endPos_) const; char StyleAt(Sci_Position position) const { return pAccess->StyleAt(position); ```

The original 'xml' commits are 1 and 2.

Not going to change segIsScriptingIndicator(), as here is no test for space before xml (if (!IsASpace(s[t])) block is not reachable in all existing tests). though (Contains(s, "xml") block can be optimized to avoid second find(). PrintScriptingIndicatorOffset-0828.patch changes for PrintScriptingIndicatorOffset() is safe and simpler than origin code.

zufuliu commented 2 months ago

StyleContext-0830.patch

StyleContext::GetCurrent() can also be marked as const.


(Contains(s, "xml") block can be optimized to avoid second find()

    if (Contains(s, "php"))
        return eScriptPHP;
    {
        const size_t xml = s.find("xml");
        if (xml != std::string::npos) {
            for (size_t t = 0; t < xml; t++) {
                if (!IsASpace(s[t])) {
                    return prevValue;
                }
            }
            return eScriptXML;
        }
    }
@@ -103,7 +103,7 @@ script_type segIsScriptingIndicator(const Accessor &styler, Sci_PositionU start,
        return eScriptJS;
    if (Contains(s, "php"))
        return eScriptPHP;
-   if (Contains(s, "xml")) {
+   {
        const size_t xml = s.find("xml");
        if (xml != std::string::npos) {
            for (size_t t = 0; t < xml; t++) {
@@ -111,8 +111,8 @@ script_type segIsScriptingIndicator(const Accessor &styler, Sci_PositionU start,
                    return prevValue;
                }
            }
+           return eScriptXML;
        }
-       return eScriptXML;
    }

    return prevValue;
zufuliu commented 2 months ago

it seems better to move Contains(s, "php") and Contains(s, "xml") cases to a new function, e.g.:

Something like following, not sure whether worth the duplication (keep segIsScriptingIndicator() unchanged).

ScriptInstruction.patch

@@ -117,6 +117,16 @@ script_type segIsScriptingIndicator(const Accessor &styler, Sci_PositionU start,
    return prevValue;
 }

+script_type segIsScriptInstruction(Accessor &styler, Sci_PositionU start, bool isXml) {
+   if (styler.MatchIgnoreCase(start, "php")) {
+       return eScriptPHP;
+   }
+   if (isXml || styler.MatchIgnoreCase(start, "xml")) {
+       return eScriptXML;
+   }
+   return eScriptPHP;
+}
+
 int PrintScriptingIndicatorOffset(Accessor &styler, Sci_PositionU start) {
    return styler.MatchIgnoreCase(start, "php") ? 3 : 0;
 }
@@ -1492,7 +1502,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int
        // handle the start of PHP pre-processor = Non-HTML
        else if ((ch == '<') && (chNext == '?') && IsPHPEntryState(state) && IsPHPStart(allowPHP, styler, i)) {
            beforeLanguage = scriptLanguage;
-           scriptLanguage = segIsScriptingIndicator(styler, i + 2, i + 6, isXml ? eScriptXML : eScriptPHP);
+           scriptLanguage = segIsScriptInstruction(styler, i + 2, isXml);
            if ((scriptLanguage != eScriptPHP) && (isStringState(state) || (state==SCE_H_COMMENT))) continue;
            styler.ColourTo(i - 1, StateToPrint);
            beforePreProc = state;
zufuliu commented 2 months ago

if ((scriptLanguage != eScriptPHP) && (isStringState(state) || (state==SCE_H_COMMENT))) continue; needs extra fix for <?xml or non-preprocessor inside string or comment.

zufuliu commented 1 month ago

HTML-Instruction-0924.patch

Patch updated "handle the start of PHP pre-processor = Non-HTML":

  1. Merged IsPHPStart(), segIsScriptInstruction() and PrintScriptingIndicatorOffset() into single function to avoid repeated GetRangeLowered() or MatchIgnoreCase().
  2. Highlight PHP short echo tag <?= as whole instead of <? + operator (similar to ASP <%=).
  3. Restrict eScriptXML to AnyOf(state, SCE_H_DEFAULT, SCE_H_SGML_BLOCK_DEFAULT), to avoid change scriptLanguage and beforeLanguage when <?xml not starts XML.
zufuliu commented 1 month ago

HTML-Instruction-0925.patch

Renamed enumeration ScriptInstruction to InstructionTag.

zufuliu commented 1 month ago

https://www.php.net/manual/en/language.basic-syntax.phptags.php

When PHP parses a file, it looks for opening and closing tags, which are <?php and ?> which tell PHP to start and stop interpreting the code between them. Parsing in this manner allows PHP to be embedded in all sorts of different documents, as everything outside of a pair of opening and closing tags is ignored by the PHP parser.

PHP includes a short echo tag <?= which is a short-hand to the more verbose <?php echo.

HTML-PHP-0927.patch

As PHP code is only valid inside tags, the code set initStyle = SCE_HPHP_DEFAULT; can be removed. IsPHPEntryState() can be simplified to exclude all PHP styles, the following nonsense code will be interpreted by PHP:

<!--<?=1?>-->
<script>
/*<?=2?>*/
</script>
nyamatongwe commented 1 month ago

The phpscript mode was introduced in March-April 2005 with contributions from Jan Martin Pettersen and Iago Rubio. Jan described it as

It depends on how you look at it, it's quite new in the pure scripting manner, as it came in one of the newest 4.xx versions of PHP (on the windows platform, not sure when it came on linux), but the language is exactly the same as the normal PHP, except it uses expects pure PHP, instead of html/php combined, and as such, doesn't use the <?php ?> tag either..

Trying to search for "phpscript" or quickly read PHP 4.x release notes doesn't find much of interest so I suspect it disappeared or has another name.

zufuliu commented 1 month ago

HTML-PHP-0930.patch Reverted changes for initStyle = SCE_HPHP_DEFAULT;.

zufuliu commented 1 month ago

Trying to search for "phpscript" or quickly read PHP 4.x release notes doesn't find much of interest so I suspect it disappeared or has another name.

I find one file that contains PHP code without tags: the interactive shell history file ~/.php_history, see https://www.php.net/manual/en/features.commandline.interactive.php