Masterminds / html5-php

An HTML5 parser and serializer for PHP.
http://masterminds.github.io/html5-php/
Other
1.59k stars 114 forks source link

Optimize the processing of text between nodes #162

Closed stof closed 5 years ago

stof commented 5 years ago

Instead of processing the text token one by one in the main loop, it is now processed in batch until the next special token (< and & which have special handling in the main loop and NUL characters which need to report a parse error).

https://blackfire.io/profiles/compare/8d7277d0-e2ed-40cf-b9b6-bffa6a523ae6/graph

There is a 51% improvement there

goetas commented 5 years ago

This looks promising :)

stof commented 5 years ago

php test/benchmark/run.php (current master, i.e. 182f34d) Loading: 101.72620534897 Writing: 37.083342075348

php test/benchmark/run.php (this PR) Loading: 69.69865322113 Writing: 37.433831691742

php test/benchmark/run_native.php (same benchmark using DOMDocument::loadHTML and DOMDocument::saveHTML instead) Loading: 10.595810413361 Writing: 3.5749840736389

And for reference, here is the benchmark running on 2.4.0: Loading: 127.82767772675 Writing: 37.827260494232

This is indeed quite promising (note that all my optimizations since 2.4.0 are focusing on the loading part only, that's why there is not much improvements on the writing side).

2.4.0 was 12x slower than the native parser This PR reaches the level of 6.5x slower than native parser.

goetas commented 5 years ago

give the time to test this by my self tomorrow, but looks great! Thanks a lot!

stof commented 5 years ago

And for the first time since I started this optimization work, the DOMTreeBuilder appears in the hot path defined by blackfire, instead of being entirely dominated by the Tokenizer :smile:

stof commented 5 years ago

note that I still have a few ideas to keep going after that one (but not as big as that one)

goetas commented 5 years ago

Now that I see it, this looks so obvious :)

give the time to test this by my self tomorrow, but looks great!

impatience :)

goetas commented 5 years ago

Nice to see this benchmark:

v2.3.1

$ php test/benchmark/run.php 10
Loading: 230.20720481873

master

$ php test/benchmark/run.php 10
Loading: 66.839385032654

(php 7.2)

That is almost 4 time faster! (and my guess is that there is still room for improvements as example in the Tokenizer::attribute() function or moving the readUntilSequence into the scanner class)