JayBizzle / Crawler-Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
https://crawlerdetect.io
MIT License
1.96k stars 255 forks source link

FEATURE Injectable components to support custom detection #384

Closed tractorcow closed 4 years ago

tractorcow commented 4 years ago

This allows, for instance, a custom app to whitelist / blacklist custom agents.

For example, this is usercode from an app which needs to manually whitelist wechat.

    /**
     * Check if a request is a crowler, with a wechat exemption
     *
     * @param HTTPRequest $request
     * @return bool
     */
    protected function checkIsCrawler(HTTPRequest $request): bool
    {
        // Add wechat exemption to crawler detection
        $exclusions = new Exclusions();
        $exclusionList = array_merge(
            $exclusions->getAll(),
            ['MicroMessenger\/']
        );
        $exclusions->setAll($exclusionList);

        // Set custom exclusions to new detector
        $detect = new CrawlerDetect();
        $detect->setExclusions($exclusions);

        // Detect bots
        $userAgent = $request->getHeader('User-Agent');
        return $detect->isCrawler($userAgent);
    }

Setters return $this to support chaining. E.g. $crawler->setUaHttpHeaders($headers)->setHeaders($appHeaders)

JayBizzle commented 4 years ago

I'm not totally averse to this, but the reason we haven't done it in the past is to encourage people to PR their bot user-agents and not keep them to themselves.

What are your thoughts on this?

tractorcow commented 4 years ago

I think that there are certain situations where flexibility is necessary; A global site might care more about catching more bots than excluding certain traffic, whereas a china-located site might care more about excluding non-regional traffic. I think a single "source of truth" ignores the regional context. :)

tractorcow commented 4 years ago

Perhaps a good medium is an "aggressive mode" where greylisted bots are recorded, and can be flagged either way?

In this context, MicroMessenger is a valid non-bot messenger, but because it's misused so extensively by bots masquerading as that user agent, it could deserve greylist status.

JayBizzle commented 4 years ago

I'm away from my main computer this week takin some time off, so will review this PR in full when I'm back. Thanks for your thoughts 👍

tractorcow commented 4 years ago

You're welcome; No rush. :)

JayBizzle commented 4 years ago

@MaxGiting Any thoughts?

MaxGiting commented 4 years ago

If this is intended to be used to quickly get user agents added to the list that should be in this library then I am opposed for the same reason you already mention @JayBizzle that they should be PR'd.

If it is for the intention of adding user agents we will never add to the library then maybe, but from the examples given above I am not sold.

To me a bot is a bot no matter of region and should be in this library. Similarly if there was a grey list of bots, they should all just be included. Bad bots often change their user agent to look like legitimate traffic by using generic browser user agents, so I'm not sure this is a big help for those situations.

@tractorcow have I misunderstood your regional and masked traffic examples?

And should WeChat simply be in this library anyway? Is it an actual browser as well or just the user agent for preview links?

tractorcow commented 4 years ago

And should WeChat simply be in this library anyway? Is it an actual browser as well or just the user agent for preview links?

Given the stance, we need to remove that agent from the blacklist, since it's a valid browser.

tractorcow commented 4 years ago

Closing to re-open with the appropriate change as requested.

tractorcow commented 4 years ago

Replacement PR at https://github.com/JayBizzle/Crawler-Detect/pull/395

Thanks for your response @MaxGiting :D