Open alecpl opened 4 years ago
Have you tried to debug it with backfire or some other profiler?
I didn't yet, but I can add that the specific content is not that important, the number of tags is. So, it looks like this library has problem with parsing big HTML pages. FYI, DOMDocument parses the sample in less than a second.
I'm not sure how useful is that, but here's xdebug profile on a smaller sample. Sorry, for Polish language, but forcing English in KCacheGrind didn't work.
So, it looks like a DOMElement::appendChild() is the main bottleneck. Here's some performance stats showing how number of tags makes a difference. PHP 7.4.
Tags | Time
---------------
10k | 1.3s
20k | 3.3s
30k | 7.9s
40k | 16.4s
50k | 28.3s
can you try to benchmark appendChild
alone and see if that slows down after a certain number of tags?
Nope, and it's the other way round (more tags, better time per tag). What's more the following script is blazingly fast (<1sec).
$doc = new DOMDocument;
$body = $doc->createElement("body");
$doc->appendChild($body);
$lines = 100000;
while ($lines--) {
$p = $doc->createElement("p");
$body->appendChild($p);
$span = $doc->createElement("span");
$p->appendChild($span);
$font = $doc->createElement("font");
$span->appendChild($font);
}
Hmm, weird...
~the bottleneck seems to be autoclose()
..., by removing that, the script completes in 3s~
NVM
This turned out to be a PHP issue that can be workedaroud by doing
$html5 = new Masterminds\HTML5([
'disable_html_ns' => true
]);
$node = $html5->loadHTML($html);
The perf issue was introduced by https://github.com/php/php-src/blob/35e0a91db717fe441a89ca9554d8843d8ee63112/ext/dom/php_dom.c and https://github.com/php/php-src/commit/84b90f639d09f002ed50c87877b62615e928b88b
Thanks for the workaround. With it my initial test script takes 8 seconds, not that bad. DOMDocument needs 0.3 second.
Did you already create a ticket in PHP's bugtracker?
Was listed by xhprof with PHP 8.3.2-1. Is this a thing or should I look other places?
I have a script that generates a HTML sample that is ~1.5MB in size. It emulates a real-world example. Then I parse it.
and here's the result:
I tested this with 2.7.0 and some older versions with no success. The sample half of that size works, but it takes 27 seconds to finish (so it's not linear).
Cross-ref: https://github.com/roundcube/roundcubemail/issues/7331