JayBizzle / Crawler-Detect

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
https://crawlerdetect.io
MIT License
2k stars 259 forks source link

False positives #326

Closed annando closed 5 years ago

annando commented 5 years ago

http.rb is just some ruby library that is - for example - part of the user-agent string of Mastodon: https://github.com/tootsuite/mastodon

hackney is a http library written in Erlang. This is used - for example - in the Pleroma project: https://git.pleroma.social/pleroma/pleroma

Faraday is just some Ruby HTTP library that can be used for good or bad: https://github.com/lostisland/faraday

github.com is part of the user-agent string Social-Relay/1.6.0-dev - https://github.com/jaywink/social-relay which is no bot or crawler.

okhttp is a HTTP library for Android: https://square.github.io/okhttp/ and it is used for example for Twidere: https://github.com/TwidereProject/Twidere-Android

UniversalFeedParser is used to subscribe to RSS feeds: https://pythonhosted.org/feedparser/introduction.html

Not sure why PixelFedBot - https://pixelfed.org got detected as crawler, but this is some "good" service.

python-requests is used as some http library for projects like microblog.pub.

The user-agent string WordPress/ is used by ActivityPub plugins for Wordpress - which is totally fine.

P.S.: During my tests I also saw that many other libraries are considered as crawler as well - although they are just some HTTP libraries for different language and they could be used in good or bad. Is this behaviour intended? I'm asking because our software project where we want to use this library is intended to communicate with many other systems that do use these kind of libraries.

JayBizzle commented 5 years ago

Hi,

Thanks for your comprehensive explanation.

The purpose of this library it to detect visits from a bot/crawler and distinguish them from an actual human.

I believe all the users-agents you highlighted above relate to non-human traffic, unless you can tell me otherwise.

Closing this for now, but feel free to reply if you feel I have closed this prematurely.

Thanks 👍

annando commented 5 years ago

It would be better if the library could have different classes of agents, to separate HTTP libraries from bots and crawlers. Sadly in this current state this library is useless for our project.

JayBizzle commented 5 years ago

Open to PRs 👍