JayBizzle / Crawler-Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
https://crawlerdetect.io
MIT License
2k stars 259 forks source link

The ability to extend crawlers list? #309

Closed godstanis closed 5 years ago

godstanis commented 5 years ago

Hi, really appreciate your work, guys, the package is great.

But why is there no way to mannualy extend the default crawlers list? For example i need to add "Sendsay.Ru/1.0; https://Sendsay.Ru/; ask@sendsay.ru" (it's a niche Russian crawler), but i have no options but to write an issue here. It would be really great if we could have some kind of a configuration. Or at least have Crawlers injected in CrawlerDetect class via constructor so we could extend it and inject our custom Crawlers implementation without rewriting the whole constructor.

Thanks)

gplumb commented 5 years ago

Adding additional strings via a constructor could be unsafe if there is a collision with one of the existing regexes.

It also means the community can't benefit from the additional crawler strings :-(

This project is very receptive to Pull Requests though... :-)

JayBizzle commented 5 years ago

Hi,

As @gplumb pointed out, we have deliberately avoided making the crawlers list extendable to hopefully encourage PRs which ultimately help the whole community.

We look forward to receiving your PR :+1:

godstanis commented 5 years ago

Thanks for the response.

@gplumb i didn't mean to actually pass strings into the constructor) I mean just to change it to something like this:

// CrawlerDetect.php
/**
 * Class constructor.
 */
public function __construct(array $headers = null, $userAgent = null, Crawlers $crawlers = null)
{
    if ($crawlers === null) {
        $this->crawlers = new Crawlers();
    } else {
        $this->crawlers = $crawlers;
    }
    //...

could be unsafe if there is a collision

Patterns are combined via '|', so i dont see any problems with it. Could you explain the type of collisions you are afraid of?

@JayBizzle

we have deliberately avoided making the crawlers list extendable to hopefully encourage PRs

@gplumb mentioned collision problems, are they still relevant?

MaxGiting commented 5 years ago

@stasgar With your suggested code, are you suggesting you would duplicate the entire Crawler class?

There is no issue with regex collisions. We only test for them so we can speed up scanning. No point checking something twice.

If you would like to get a new user agent into this library we encourage PR's so everyone can benefit. They are normally merged and released within a few days maximum.

gplumb commented 5 years ago

My original concern about regex collisions was based on an assumption about how the constructor code might look, something that actually isn't having seen @stasgar's code :-)

godstanis commented 5 years ago

@MaxGiting i would extend Crawler class and overwrite it's getAll method.

Ok, then pls close this issue and i will PR this really rare Russian crawler)

MaxGiting commented 5 years ago

@gplumb I see what you mean! Yes there could have been in that scenario.

@stasgar There's probably a few other rare and even defunct user agents in our list by now, but hopefully some other people will benefit from your contribution.