duzun / hQuery.php

An extremely fast web scraper that parses megabytes of invalid HTML in a blink of an eye. PHP5.3+, no dependencies.
https://duzun.me/playground/hquery
MIT License
361 stars 74 forks source link

Big files was parsed slowly #94

Open PavelFil opened 2 months ago

PavelFil commented 2 months ago

I have huge HTML 2MB:

<!DOCTYPE html>
<html>
<head>
</head>
<body>
    <div><div>dnbfkjsb asdhfjkashjkfhalkshdfljkhaskdj fhkajsdfkjaslflkjashdlfkhaskldfhaklsj hdflkasdfkjlhasdflkashdklfj hasdk</div></div>
    <!--Repeat row below 19000 times-->
</body>
</script>
</html>

And the request below takes 78 seconds:

    hQuery::fromHTML($html)->find('script,style');

In browser equal request takes less than 0.2 seconds.

duzun commented 2 months ago

I was able to reproduce this synthetic test. Turns out hQuery/Parser/HTML::parse() is not linear with respect to the number of tags in the document 🤔. In other words, the hQuery::fromHTML($html) is affected, but not the >find('script,style').

I'll try to analyze the code and improve it.

Thank you for the challenge!

duzun commented 2 months ago

I have an intuition that the issue is in the heavy usage of strspn and strcspn for parsing HTML. I had the assumption that they are very fast. But by reading the implementation code I realize that each call is initializing an array of 256 bytes, even for small character list. This doesn't scale well.