JayBizzle / Crawler-Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
https://crawlerdetect.io
MIT License
2.01k stars 258 forks source link

Problems with twitter and facebook bots #303

Closed himiro closed 6 years ago

himiro commented 6 years ago

Hello,

We use crawler-detect to detect social networks bots and we've noticed that some bot user agents passed the tests. There they are : Twitter :

Facebook :

I've put them in the tests/crawlers.txt but we cannot differenciate these user agents from ordinary user agents (except for the first user agent) so I just add the TrendsmapResolver to the Fixtures/Crawlers.php.

Could you please let me know how to recognize that tey're bots ?

Yours sincerely,

Mathilde

MaxGiting commented 6 years ago

Can I ask how you know (with the exception of TrendsmapResolver) that the first batch are all from Twitter and the second are all from Facebook?

There is another issue #238 that has brought up Facebook using different headers when prefetching. Currently we do not inspect the HTTP_PURPOSE or HTTP_X_PURPOSE headers, but Facebook and others have been known to use these when prefetching data. I.e not an actual user visit.

Also yes we should add TrendsmapResolver as a known bot. We currently check for Trendsmap Resolver with a space but they have obviously removed this so we need to check for both.

Would you like to add TrendsmapResolver in a PR?

himiro commented 6 years ago

Thank's for the answer. We retrieve the social network datas, so we know from which one the action comes.

We already check the prefetching but I will do some complementary tests to see if we missed something.

The PR is done.

MaxGiting commented 6 years ago

We retrieve the social network datas, so we know from which one the action comes.

Could you explain this in more detail?

I have left some feedback on the PR, thank you 👍

himiro commented 6 years ago

We get the url and the headers which give us some datas like user agent.

MaxGiting commented 6 years ago

Sorry I don't understand how you are finding out that the list of user agents are from Twitter and Facebook.

Is there a specific header or IP address that is telling you those user agents are from social networks?

himiro commented 6 years ago

I didn't managed this part but apparently it comes from the url and its routes.