Open blaine-arcjet opened 2 months ago
I went through the list of RegExps that isbot
provides and removed everyone where we overlapped. It is worth noting that isbot
used case-insensitive RegExps and we use case-sensitive here.
I've grouped them below.
Generic patterns: I think we'll discard all of these because they are so generic that we wouldn't be able to identify a specific bot.
[ ]+bot
^[a-z.0-9/ \-_]*bot
^email
^java/
^javascript
^php
analyzer
archiver
bot($|[/\);-]+)
checker
cloudflare
crawler
extractor
fetcher
http[s]?://
monitoring
optimizer
robot
scraper
spider
transcoder
uptime
Not user agents: I don't think these are actually user agents but need to double check.
google page speed insight
google search
Little to no overlap: We don't seem to match any of these or they overlap with a different pattern.
^ad muncher
^avsdevicesdk/
^bidtellect/
^blogtrottr
^boardreader
^castro
^collectd
^comodo
^cortex
^ddg_android
^ez publish
^fdm[\s/]\d
^holmes
^lua-resty-http
^navermailapp
^netlyzer fastprobe
^netsurf
^newsgator
^octopus
^postmanruntime
^prittorrent
^rainmeter
^ramblermail
^server density
^sitesucker
^snapchat
^spotify/
^sprinklr
^the knowledge ai
^unityplayer
^websitepulse
^windows-rss
^wsr-agent
^yahoo:linkexpander
^yahoocachesystem
^zooshot
apachebench/
arachni
banca caboto
browsershots
catchpoint
curious george
datadog agent
daum(oa)?[ /][0-9]
- there is an "instance" of this in "mediapartners-google" because it mimicsdmbrowser
duplexweb-google
gobuster
gomezagent
googleimageproxy
goose/
guzzlehttp
help@dataminr\.com
heritrix
- we classify this as internet archive and may want to fine-tuneinternetarchive
- see directly abovekaspersky
kouio\.com
larbin
- this shows up in the instances of binlarm_bot_tab
mnogosearch
prtg network monitor
pycurl
qihoobot
qqdownload
robozilla
site24x7
sixy\.ch
sparkler/
sputnik
statically-
staticlogin:productcbox
statuscake
supybot
thinkchaos
turbotabbee
unshortenit
urlgrabber/
webreaper
webthumbnail
whatcms/
We should compare the list of user-agents we match vs isbot to see if we are missing any.