JayBizzle / Crawler-Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
https://crawlerdetect.io
MIT License
1.99k stars 259 forks source link

Returning unexpected results #365

Closed sumanthratna closed 4 years ago

sumanthratna commented 4 years ago

I installed via composer; this is what I'm trying:

require __DIR__ . '/vendor/autoload.php';
// ...
$CrawlerDetect = new \Jaybizzle\CrawlerDetect\CrawlerDetect;
error_log($CrawlerDetect->isCrawler());
error_log($CrawlerDetect->isCrawler($_SERVER['HTTP_USER_AGENT']));
error_log($_SERVER['HTTP_USER_AGENT']);
error_log($CrawlerDetect->getMatches());

When I run GTmetrix on my site, I get two different hits: one has GTmetrix in the user agent header, and the other looks like a real user. When using the online service, I see that the first hit is correctly identified as a crawler and the second isn't (the second actually is a crawler but I don't think there's anything that can be done about it since it looks real).

Here's what's written to the logs:

# first hit (looks like a crawler and is a crawler):
1  # isCrawler()
1  # isCrawler($_SERVER['HTTP_USER_AGENT'])
Mozilla/5.0 (X11; Linux x86_64; GTmetrix https://gtmetrix.com/) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36  # $_SERVER['HTTP_USER_AGENT']
null  # getMatches()

# second hit (looks like a user but is a crawler):
  # isCrawler()
  # isCrawler($_SERVER['HTTP_USER_AGENT'])
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36  # $_SERVER['HTTP_USER_AGENT']
null  # getMatches()

What I would expect is:

# first hit (looks like a crawler and is a crawler):
true  # isCrawler()
true  # isCrawler($_SERVER['HTTP_USER_AGENT'])
Mozilla/5.0 (X11; Linux x86_64; GTmetrix https://gtmetrix.com/) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36  # $_SERVER['HTTP_USER_AGENT']
GTmetrix  # getMatches()

# second hit (looks like a user but is a crawler):
false  # isCrawler()
false  # isCrawler($_SERVER['HTTP_USER_AGENT'])
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36  # $_SERVER['HTTP_USER_AGENT']
null  # getMatches()

Loading the class with new \Jaybizzle\CrawlerDetect\CrawlerDetect() (note the parentheses) returns the same incorrect results.

Output of php -v:

PHP 7.0.33-0ubuntu0.16.04.7 (cli) ( NTS )
Copyright (c) 1997-2017 The PHP Group
Zend Engine v3.0.0, Copyright (c) 1998-2017 Zend Technologies
    with Zend OPcache v7.0.33-0ubuntu0.16.04.7, Copyright (c) 1999-2017, by Zend Technologies
JayBizzle commented 4 years ago

We can only analyse the user-agent that is set by the server. I don't think there is anything we can "fix" here unless i am misunderstanding the issue? 😕

sumanthratna commented 4 years ago

My understanding is that we cannot classify the second hit as a crawler, but we can for the first. The first hit specifically mentions GTmetrix and the URL. Further, the website classified the first hit's user-agent as a crawler, so I'm unsure why it doesn't work on my server.