ScintillaOrg / lexilla

A library of language lexers for use with Scintilla
https://www.scintilla.org/Lexilla.html
Other
179 stars 64 forks source link

[HTML, XML] Don't interpret `<?` and `<%` inside CDATA section #252

Closed zufuliu closed 2 months ago

zufuliu commented 4 months ago

Created for https://sourceforge.net/p/scintilla/bugs/1078/.

Per https://html.spec.whatwg.org/multipage/syntax.html#cdata-sections and https://www.w3.org/TR/xml11/#sec-cdata-sect, CDATA section is a literal block and only the closing ]]> is recognized. I think here are two fixes:

  1. Only interpret <? inside CDATA section for SCLEX_PHPSCRIPT.
  2. Don't interpret <? inside XML CDATA section (original bug), let SCLEX_HTML as is.
mpheath commented 4 months ago

There is also <% in XML CDATA causing issues.

```diff diff --git a/lexers/LexHTML.cxx b/lexers/LexHTML.cxx index 4a5e092d..8a97d90e 100644 --- a/lexers/LexHTML.cxx +++ b/lexers/LexHTML.cxx @@ -1390,6 +1390,10 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int else if (isMako && makoComment) { } + // Ignore everything in XML CDATA until ]]> + else if (isXml && state == SCE_H_CDATA) { + } + // generic end of script processing else if ((inScriptType == eNonHtmlScript) && (ch == '<') && (chNext == '/')) { // Check if it's the end of the script tag (or any other HTML tag) ``` Check for XML CDATA before processing the other conditions seems to solve both `
zufuliu commented 4 months ago

Above change only fix the bug for XML (SCLEX_XML), not client HTML (SCLEX_HTML and SCLEX_PHPSCRIPT). I think we can add two properties to fix the bug for HTML:

  1. lexer.html.php_tag (default on, only used for SCLEX_HTML), when enabled will interpret <?.
  2. lexer.html.asp_tag (default on, not used for SCLEX_XML), when enabled will interpret <%. ASP tag was removed from PHP 7, see https://wiki.php.net/rfc/remove_alternative_php_tags and https://wiki.php.net/rfc/deprecate_php_short_tags
nyamatongwe commented 4 months ago

@zufuliu Only interpret <? inside CDATA section for SCLEX_PHPSCRIPT

A quick search for PHP examples shows PHP embedded in web pages, not stand-alone PHP (SCLEX_PHPSCRIPT) which is the unusual case. For web pages, it is a text preprocessor over HTML so takes precedence for SCLEX_HTML. It is less commonly used for XML but may be used in the same way to produce data so is active for SCLEX_XML.

@zufuliu Don't interpret <? inside XML CDATA section (original bug), let SCLEX_HTML as is.

PHP generated XML is also useful.

There could be options to turn off pre-processors.

zufuliu commented 4 months ago

There could be options to turn off pre-processors.

I still think these options should default off for XML, as using PHP (or server-side technology) to preprocess XML is really rare (MIME configurations), otherwise most if not all use of SCLEX_XML will need to turn them off.

nyamatongwe commented 4 months ago

I still think these options should default off for XML

It is not reasonable to break current applications.

mpheath commented 4 months ago

Please try again.

Issue252abc.zip

Both options default to `1` so is not a breaking change. If set to `0` then CDATA content is a single style until `]]>` is reached. ```diff diff --git a/lexers/LexHTML.cxx b/lexers/LexHTML.cxx index 4a5e092d..2089bfe7 100644 --- a/lexers/LexHTML.cxx +++ b/lexers/LexHTML.cxx @@ -724,6 +724,8 @@ struct OptionsHTML { bool allowScripts = true; bool isMako = false; bool isDjango = false; + bool styleHTMLCDATA = true; + bool styleXMLCDATA = true; bool fold = false; bool foldHTML = false; bool foldHTMLPreprocessor = true; @@ -767,6 +769,12 @@ struct OptionSetHTML : public OptionSet { DefineProperty("lexer.xml.allow.scripts", &OptionsHTML::allowScripts, "Set to 0 to disable scripts in XML."); + DefineProperty("lexer.html.cdata.tag", &OptionsHTML::styleHTMLCDATA, + "Set to 0 to disable styling within CDATA tags."); + + DefineProperty("lexer.xml.cdata.tag", &OptionsHTML::styleXMLCDATA, + "Set to 0 to disable styling within CDATA tags."); + DefineProperty("lexer.html.mako", &OptionsHTML::isMako, "Set to 1 to enable the mako template language."); @@ -1221,6 +1229,7 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int const bool allowScripts = options.allowScripts; const bool isMako = options.isMako; const bool isDjango = options.isDjango; + const bool styleCDATA = (isXml ? options.styleXMLCDATA : options.styleHTMLCDATA); const CharacterSet setHTMLWord(CharacterSet::setAlphaNum, ".-_:!#", true); const CharacterSet setTagContinue(CharacterSet::setAlphaNum, ".-_:!#[", true); const CharacterSet setAttributeContinue(CharacterSet::setAlphaNum, ".-_:!#/", true); @@ -1390,6 +1399,9 @@ void SCI_METHOD LexerHTML::Lex(Sci_PositionU startPos, Sci_Position length, int else if (isMako && makoComment) { } + else if (state == SCE_H_CDATA && !styleCDATA) { + } + // generic end of script processing else if ((inScriptType == eNonHtmlScript) && (ch == '<') && (chNext == '/')) { // Check if it's the end of the script tag (or any other HTML tag) ``` Here is a lua on double click function that can be added to SciTEStartup.lua to help see the difference when double clicking on the edit pane. ```lua function OnDoubleClick() if editor:GetPropertyInt('lexer.html.cdata.tag', 1) == 1 then editor.Property['lexer.html.cdata.tag'] = '0' editor.Property['lexer.xml.cdata.tag'] = '0' else editor.Property['lexer.html.cdata.tag'] = '1' editor.Property['lexer.xml.cdata.tag'] = '1' end end ``` Note that the `div` tags in the html file display as red. If `div` is later added to the keyword list in SciTE.properties then that may cause TestLexers to create new styled files.
zufuliu commented 4 months ago
  1. lexer.html.php_tag (default on, only used for SCLEX_HTML), when enabled will interpret <?.

This is complex than I thought, <? in HTML is treated as comment by browser, see https://html.spec.whatwg.org/multipage/parsing.html#parse-error-unexpected-question-mark-instead-of-tag-name

There could be options to turn off pre-processors.

Above patch only fixed CDATA (original bug), I think we can add options to turn off pre-processors globally:

  1. For XML, <? can be PHP start tag, XML instruction or error. <% is ASP (or template) start tag or error.
  2. For HTML, <? can be PHP start tag or comment start. <% is ASP (or template) start tag or element text.
nyamatongwe commented 4 months ago

lexer.html.cdata.tag / lexer.xml.cdata.tag is confusing to me: its treating the file as ASP/PHP apart from a small exception of CDATA. ASP and PHP may have some understanding of the HTML/XML elements and so may special-case CDATA but that should be checked in the ASP and PHP documentation or with examples.

However, the motivating example in notepad-plus-plus/notepad-plus-plus#14576 does not appear to be intended for ASP or PHP processing and would be better treated as basic unprocessed XML.

Perhaps the desire here is to avoid splitting file types into with/without preprocessing so the user doesn't have to deal with this choice.

zufuliu commented 4 months ago

PHP does not understand HTML/XML elements: https://www.php.net/manual/en/language.basic-syntax.phptags.php

When PHP parses a file, it looks for opening and closing tags, which are <?php and ?> which tell PHP to start and stop interpreting the code between them. Parsing in this manner allows PHP to be embedded in all sorts of different documents, as everything outside of a pair of opening and closing tags is ignored by the PHP parser.

ASP may (not test) understand HTML elements as it need to parse asp: prefixed tag and runas attribute on HTML element.

nyamatongwe commented 4 months ago

PHP control could have 3 states since <?php is much less likely to mean something different than <? although it does in this case.

  1. disabled
  2. <?php enabled
  3. <?php and <? enabled

For ASP, the character just after <% could be checked as that decides between different ASP modes. From https://learn.microsoft.com/en-us/troubleshoot/developer/webapps/aspnet/development/inline-expressions it is unclear whether whitespace is required for embedded code blocks. There are other languages like JSP (and derivatives) that also use <% and currently work because of that similarity so could affect whether this is reasonable.

nyamatongwe commented 3 months ago

Here is a patch that implements lexer.xml.allow.php and lexer.html.allow.php with three states:

Set to 0 to disable PHP in HTML, 1 to accept <?php, 2 to also accept <?. The default is 2.

allowPHP.patch.txt

zufuliu commented 3 months ago

if (allowPHP == AllowPHP::PHP) should also allow short echo tag: <?= ?>. https://www.php.net/manual/en/language.basic-syntax.phptags.php

As short tags can be disabled it is recommended to only use the normal tags (<?php ?> and <?= ?>) to maximise compatibility.

nyamatongwe commented 3 months ago

@nyamatongwe For ASP, the character just after <% could be checked ... it is unclear whether whitespace is required for embedded code blocks.

Whitespace is not required and ASP examples with '<%` followed immediately by a call is common.

zufuliu commented 3 months ago

Needs update segIsScriptingIndicator() for following code: https://github.com/ScintillaOrg/lexilla/blob/a6f1998ac753fcb6faf7d66773fde1f3b513934e/lexers/LexHTML.cxx#L104-L105

PHP script tag was removed in PHP 7, https://wiki.php.net/rfc/remove_alternative_php_tags

This RFC proposes the removal of ASP tags (<%) and script tags (<script language=php>) as a means of entering or leaving PHP mode.