PH5P lexer can't handle HTML5 sectioning elements

xemlock commented 5 years ago

Even if HTML5 sectioning elements (section, nav, article, aside, header, footer) are added to HTML definition, they are silently removed from the output.

Looks like it's because the code responsible for parsing them is in WIP state: https://github.com/ezyang/htmlpurifier/blob/master/library/HTMLPurifier/Lexer/PH5P.php#L2788

Minimum snippet to reproduce this:

<?php

require './vendor/autoload.php';

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.DefinitionID', 'html5');
$config->set('Core.LexerImpl', 'PH5P');

if ($def = $config->maybeGetRawHTMLDefinition()) {
    $def->addElement('nav', 'Block', 'Flow', 'Common');
}

$purifier = new HTMLPurifier($config);

echo $purifier->purify('<div><nav>Foo</nav></div>');

The result is:

<div>Foo</div>

If you comment out the line that sets lexer (or use any other built-in lexer), the result is correct:

<div><nav>Foo</nav></div>

ezyang commented 5 years ago

I'm extremely unlikely to fix bugs in PH5P. If you can find another HTML5 complaint HTML parser we can swap in instead of PH5P that would be the best way to go.

bytestream commented 5 years ago

https://github.com/Masterminds/html5-php https://github.com/ivopetkov/html5-dom-document-php

https://www.reddit.com/r/PHP/comments/9dhp9z/a_better_html5_parser_for_php/e5hxb40/

xemlock commented 5 years ago

Hi, @ezyang Yes, that's perfectly understandable. Especially when PH5P is marked in the source as experimental. I just wanted to raise the fact that it cannot be used as a replacement for other parsers (especially when dealing with HTML5 tags), and need to be used with caution.

Anyway, I think this issue can be closed as a Won't Do, as other lexer implementations (DOMLex and DirectLex) are good enough.

ezyang / htmlpurifier

PH5P lexer can't handle HTML5 sectioning elements #226