Masterminds / html5-php

An HTML5 parser and serializer for PHP.
http://masterminds.github.io/html5-php/
Other
1.59k stars 114 forks source link

Parsing document with a lot of HTML tags is slow #181

Open alecpl opened 4 years ago

alecpl commented 4 years ago

I have a script that generates a HTML sample that is ~1.5MB in size. It emulates a real-world example. Then I parse it.

$html = '<HTML><BODY>';
$lines = 20000;
while ($lines--) {
    $html .= '<P DIR=LTR><SPAN LANG="en-gb"><FONT FACE="Consolas">&gt;&gt; </FONT></SPAN></P>';
}

$html5 = new Masterminds\HTML5();
$node  = $html5->loadHTML($html);

and here's the result:

PHP Fatal error:  Maximum execution time of 120 seconds exceeded in vendor/masterminds/html5/src/HTML5/Parser/DOMTreeBuilder.php on line 433
PHP Stack trace:
PHP   1. {main}() test.php:0
PHP   2. Masterminds\HTML5->loadHTML() test.php:23
PHP   3. Masterminds\HTML5->parse() vendor/masterminds/html5/src/HTML5.php:98
PHP   4. Masterminds\HTML5\Parser\Tokenizer->parse() vendor/masterminds/html5/src/HTML5.php:174
PHP   5. Masterminds\HTML5\Parser\Tokenizer->consumeData() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:89
PHP   6. Masterminds\HTML5\Parser\Tokenizer->tagOpen() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:132
PHP   7. Masterminds\HTML5\Parser\Tokenizer->tagName() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:284
PHP   8. Masterminds\HTML5\Parser\DOMTreeBuilder->startTag() vendor/masterminds/html5/src/HTML5/Parser/Tokenizer.php:388

I tested this with 2.7.0 and some older versions with no success. The sample half of that size works, but it takes 27 seconds to finish (so it's not linear).

Cross-ref: https://github.com/roundcube/roundcubemail/issues/7331

goetas commented 4 years ago

Have you tried to debug it with backfire or some other profiler?

alecpl commented 4 years ago

I didn't yet, but I can add that the specific content is not that important, the number of tags is. So, it looks like this library has problem with parsing big HTML pages. FYI, DOMDocument parses the sample in less than a second.

alecpl commented 4 years ago

I'm not sure how useful is that, but here's xdebug profile on a smaller sample. Sorry, for Polish language, but forcing English in KCacheGrind didn't work. xdebug

alecpl commented 4 years ago

So, it looks like a DOMElement::appendChild() is the main bottleneck. Here's some performance stats showing how number of tags makes a difference. PHP 7.4.

Tags  |  Time
---------------
10k   |   1.3s
20k   |   3.3s
30k   |   7.9s
40k   |  16.4s
50k   |  28.3s
goetas commented 4 years ago

can you try to benchmark appendChild alone and see if that slows down after a certain number of tags?

alecpl commented 4 years ago

Nope, and it's the other way round (more tags, better time per tag). What's more the following script is blazingly fast (<1sec).

$doc = new DOMDocument;
$body = $doc->createElement("body");
$doc->appendChild($body);
$lines = 100000;
while ($lines--) {
    $p = $doc->createElement("p");
    $body->appendChild($p);
    $span = $doc->createElement("span");
    $p->appendChild($span);
    $font = $doc->createElement("font");
    $span->appendChild($font);
}
goetas commented 4 years ago

image

goetas commented 4 years ago

Hmm, weird...

goetas commented 4 years ago

~the bottleneck seems to be autoclose()..., by removing that, the script completes in 3s~ NVM

goetas commented 4 years ago

This turned out to be a PHP issue that can be workedaroud by doing

$html5 = new Masterminds\HTML5([
    'disable_html_ns' => true
]);
$node  = $html5->loadHTML($html);

The perf issue was introduced by https://github.com/php/php-src/blob/35e0a91db717fe441a89ca9554d8843d8ee63112/ext/dom/php_dom.c and https://github.com/php/php-src/commit/84b90f639d09f002ed50c87877b62615e928b88b

alecpl commented 4 years ago

Thanks for the workaround. With it my initial test script takes 8 seconds, not that bad. DOMDocument needs 0.3 second.

Did you already create a ticket in PHP's bugtracker?

steinmb commented 8 months ago

Was listed by xhprof with PHP 8.3.2-1. Is this a thing or should I look other places?

Screenshot 2024-02-23 at 12 29 38