JayBizzle / Crawler-Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
https://crawlerdetect.io
MIT License
1.96k stars 255 forks source link

Regex concatenation performance concern #401

Closed Bilge closed 3 years ago

Bilge commented 4 years ago

https://github.com/JayBizzle/Crawler-Detect/blob/02b24e5d4dc347737577f48c688ee14c3b5dfd4f/src/CrawlerDetect.php#L99-L102

This mass concatenation of pattern pieces into a final regex should be precomputed by the library instead of making consuming applications perform this work on every instantiation of the CrawlerDetect object. That is, the fixture arrays should be transformed into the final regex string and output to a precomputed PHP file to save consumers from this unnecessary overhead. I suggest a PHP file because opcache can automatically cache the file for additional performance gains.

Example

patterns.php

<?php
return '/(.*Java.*outbrain| YLT|^b0t$|^bluefish ...)/i';
JayBizzle commented 4 years ago

Again, I would like to see some benchmark figures to see what kind of improvement could be gained here.

Concatenating the regex into one precompiled regex string would also not allow us to easily test regex collisions...

https://github.com/JayBizzle/Crawler-Detect/blob/ddfeeeddc64155ee631390161024dce26652bef5/tests/UATests.php#L112-L126

Bilge commented 4 years ago

Concatenating the regex into one precompiled regex string would also not allow us to easily test regex collisions...

That's simply not true because the precomputed list cannot be maintained if the original list is deleted. Ergo the original array still exists and can still be used in the test. They're not mutually exclusive.

JayBizzle commented 4 years ago

An example would help here, it's not clear what you mean