ezyang / htmlpurifier

Standards compliant HTML filter written in PHP
http://htmlpurifier.org
GNU Lesser General Public License v2.1
3.07k stars 327 forks source link

Minor whitespace differences in PHP7.3 #237

Open PatrickRose opened 4 years ago

PatrickRose commented 4 years ago

We're currently updating our app to PHP 7.3, and we're finding some unit test failures that don't occur on PHP 7.1. It's reproducible just using this script

<?php

require_once 'vendor/autoload.php';

$sHTML = '<p>Paragraph 1</p><textarea></textarea> <p>Paragraph 2</p>';

$oPurifier = new HTMLPurifier(); // version 4.11.0

echo $oPurifier->purify($sHTML);

On PHP 7.1, we get <p>Paragraph 1</p><p>Paragraph 2</p>. On PHP 7.3 we get <p>Paragraph 1</p> <p>Paragraph 2</p>.

There seems to be various cases where whitespace is added where it wasn't before:

'">><marquee><img src=x onerror=confirm(1)></marquee>"></plaintext\></|\><plaintext/onmouseover=prompt(1)>
<script>prompt(1)</script>@gmail.com<isindex formaction=javascript:alert(/XSS/) type=submit>'-->"></script>
<script>alert(document.cookie)</script>">
<img/id="confirm&lpar;1)"/alt="/"src="/"onerror=eval(id)>'">
<img src="http://www.shellypalmer.com/wp-content/images/2015/07/hacked-compressor.jpg">

On PHP7.3 that adds a newline before the @gmail which isn't there in PHP 7.1

<HEAD><META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=UTF-7"> </HEAD>+ADw-SCRIPT+AD4-alert(\'XSS\');+ADw-/SCRIPT+AD4- adds a space before the beginning of the output (though the HEAD etc is stripped)

I don't think this is an issue because whitespace doesn't normally mean anything, but I don't know whether it would signify some other weirdness that people need to be aware of elsewhere?

ezyang commented 4 years ago

This is probably related to the version of libxml that is being bundled with PHP. If someone wants to read the libxml changelog that would be very helpful.

mbrodala commented 4 years ago

I can confirm this issue.

Before <br /> <br /> became <br><br>, now it becomes <br> <br>, thus with an additional space after the tag. (Thus actually correct.)

Unfortunately libxml2 is quite a mess, the official site is not updated at all and there is no updated changelog so one would need to check the difference between the latest versions.

For us the previous version was 2.9.4, now we are on 2.9.9. There are quite a few changes here ...

ezyang commented 4 years ago

Well, I would certainly agree the new behavior is "more correct". So hopefully there is nothing else to worry about; just update your unit tests and carry on...

mbrodala commented 4 years ago

Sure it's not a dealbreaker but still unexpected. But as it seems nothing this package can do about.