ezyang / htmlpurifier

Standards compliant HTML filter written in PHP
http://htmlpurifier.org
GNU Lesser General Public License v2.1
3.07k stars 327 forks source link

Support for conditional HTML commenting e.g. `<!--[if mso]>` #379

Open mitchelljy opened 1 year ago

mitchelljy commented 1 year ago

The syntax specified here for conditional Outlook CSS does not seem to be compatible with HTMLPurifier, everything inside the conditional comment will be removed.

The simplified version of the syntax I'm trying to use is:

<!--[if mso]>
   <a href="https://swifthalf.com/" target="_blank">
      conditional link
   </a>
<![endif]-->
<!--[if !mso]>
  <!-- -->
    <a href="https://swifthalf.com"  target="_blank">
        default link
    </a>
  <!--
<![endif]-->

As these are HTML comments, I initially thought I might be able to use the HTML.AllowedComments config option, however it does not have an option to allow ALL comments. The exact comment body has to be specified, which will not work for this use case as the content is not always known.

I next tried HTML.AllowedCommentsRegexp with a regexp that should allow everything:

$config->set('HTML.AllowedCommentsRegexp', '/^.*?$/');

However this also does not work. It seems that in the <!--[if mso]> comment marker, the non-standard [if mso] seems to cause the comment to be removed regardless.

Is there a way to support this syntax currently or is it not possible?

Thank you for any help!

bytestream commented 1 year ago

This is because of https://github.com/ezyang/htmlpurifier/blob/6eb6123036835e4d60f4afcace5ccd9fdbab5cd7/library/HTMLPurifier/Lexer.php#L277

Conditional comments are only supported by IE 5 - 9, so lack of support only really affects a tiny minority of people?

mitchelljy commented 1 year ago

This is because of

https://github.com/ezyang/htmlpurifier/blob/6eb6123036835e4d60f4afcace5ccd9fdbab5cd7/library/HTMLPurifier/Lexer.php#L277

Conditional comments are only supported by IE 5 - 9, so lack of support only really affects a tiny minority of people?

These conditionals are automatically inserted by a service that some of our users use to build email HTML documents. I'm guessing the service is including these conditionals in order to support older email clients.

It would be really nice if these conditional comments didn't unavoidably break a HTML document that's been passed through HTMLPurifier, as it seems they are still in use sometimes in modern browsers and email clients for maximum backwards compatibility.

If it's easy enough to implement, could that be made optional?

fkoyer commented 7 months ago

Conditional comments are still used extensively in email to target Outlook clients.

Why do we even need to remove these? If you simply remove all comments, then conditional comments will get removed also. The problem with removeIEConditional is that it removes too much. Consider this example:

<!--[if mso]>
    This is inside an HTML comment. It will be displayed by Outlook only
<!-- <![endif]-->
<!--[if !mso]><!---->
    This is not inside an HTML comment. It will be displayed by all non-Outlook email clients
<!-- <![endif]-->

Unfortunately, HTMLPurifier removes both of these so neither one is displayed. I suggest getting rid of removeIEConditional altogether but if it's absolutely necessary then at least modify the regex like so

diff --git a/library/HTMLPurifier/Lexer.php b/library/HTMLPurifier/Lexer.php
index 1f552a17..bbe47769 100644
--- a/library/HTMLPurifier/Lexer.php
+++ b/library/HTMLPurifier/Lexer.php
@@ -277,7 +277,7 @@ class HTMLPurifier_Lexer
     protected static function removeIEConditional($string)
     {
         return preg_replace(
-            '#<!--\[if [^>]+\]>.*?<!\[endif\]-->#si', // probably should generalize for all strings
+            '#<!--\[if (?!!)[^>]+\]>.*?<!\[endif\]-->#si', // probably should generalize for all strings
             '',
             $string
         );

Currently I'm working around the issue by removing all HTML comments before passing to HTMLPurifier.

ezyang commented 6 months ago

I'm unlikely to ever merge PRs that add support for this, because the conditional comment syntax is not well specified and I don't have time to exactly reverse engineer how it is actually implemented to ensure it can't be used as the basis for an XSS bypass.

fkoyer commented 6 months ago

I agree with you. That's why I think the best course of action is to eliminate removeIEConditional altogether. If you really want to remove conditional comments then you'd have to follow the spec to parse and remove them properly. The current method of removing conditional comments is flawed and actually does more harm than good.

It's not necessary to remove conditional comments because HTMLPurifier removes all comments by default. Conditional comments are just specially formatted HTML comments. I submitted PR #401 that includes a unit test to demonstrate that conditional comments are converted to objects of HTMLPurifier_Token_Comment.

By doing it this way:

The title of this issue could be changed from Support for conditional HTML commenting to Remove special handling of conditional HTML commenting

fkoyer commented 6 months ago

Thanks for merging the PR. I think that fixes the issue. Although conditional comments are still not supported by HTMLPurifier, users can opt to allow them with this regexp:

$config->set('HTML.AllowedCommentsRegexp', '/\[(if|endif)\b/');