ezyang / htmlpurifier

Standards compliant HTML filter written in PHP
http://htmlpurifier.org
GNU Lesser General Public License v2.1
3.07k stars 327 forks source link

MakeWellFormed strategy when attempting to fix invalid markup messes it up even more #258

Open xemlock opened 4 years ago

xemlock commented 4 years ago

MakeWellFormed strategy when attempting to fix invalid markup messes it up even more. Please consider the following test script:

<?php
$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
echo $purifier->purify('<p><i><ul><li>text</li></ul></i></p>');

one would expect the output:

<p><i></i></p><ul><li>text</li></ul>

or ideally:

<p><i></i></p><ul><li><i>text</i></li></ul>

Instead we get:

<p><i></i></p><i>text</i>

Tested against HTMLPurifier 4.12.0.

By doing some digging I found that setting $formatting property to false on <i> element definition in the Presentation module helps a little - the <ul> structure is retained. The drawback of this is that the carrying <i> element no longer works.

This suggests that the tree-fixing algorithm in HTMLPurifier_Strategy_MakeWellFormed::execute() requires some tuning.

ezyang commented 4 years ago

A better fix will probably be to use the PH5P parser which will get you more HTML5-compliant parsing. I don't intend to fix up MakeWellFormed to make it closer to HTML5 behavior, it's an evolutionary dead end.