Open Pi-George opened 4 years ago
It should be simple to construct own list. Only crawlers access robots.txt
.
$ tail -n +2 sergeykish.com_usage_* | grep robots.txt | cut -d, -f13 -s | sort -u | sed 's/^"//'
apitools gsutil/4.50 Python/3.8.3 (linux) google-cloud-sdk/293.0.0 analytics/disabled
BananaBot/0.6.1
BorneoBot/0.9.1 (crawlcheck123@gmail.com)
...
I have logs in Google Cloud Storage format. Mimic https://github.com/allinurl/goaccess/blob/master/src/browsers.c#L125
$ tail -n +2 sergeykish.com_usage_* | grep robots.txt | cut -d, -f13 -s | sort -u | sed 's/^"//;s/$/\tCrawlers/' > browsers.list
Now run goaccess --ignore-crawlers --browsers-file=browsers.list
EDIT: It blocks too much — cut
splits User Agent string on first ,
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://aspiegel.com/petalbot),gzip(gfe)
leading to blocking entire
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML
I've tried
$ tail -n +2 sergeykish.com_usage_2020_* | grep robots.txt | awk -F'","' '{print $13, "\tCrawlers" }' | sort -u
goaccess can't process it though no tabs in User Agent
Fatal error has occurred
Error occurred at: src/browsers.c - parse_browser_token - 279
Malformed browser name at line: 10
Looks like the way to go is get unique bot names from
$ tail -n +2 sergeykish.com_usage_2020_* | grep robots.txt | awk -F'","' '{print $13}' | sort | uniq -c | sort -n
In my case bots top overall list too
$ tail -n +2 sergeykish.com_usage_2020_* | awk -F'","' '{print $13}' | sort | uniq -c | sort -n
Regular users normally don't have unknown browsers, and if they do I don't really need them in the report to get a good representation of my userbase.
i think this has not been the case for many years now. there are tons of unknown user agents that are legitimate: all the embedded browsers and all the apps on the mobile phones. for example sometimes our greatest traffic comes from the hacker news app which has useragents like: HackerNews/1493 CFNetwork/1240.0.4 Darwin/20.6.0
should this be counted by goaccess
? i think definitely.
actually i came here to report that i see sometimes huge discrepancies between our manually pruned log files and what goaccess reports with ignore-crawlers
: 13368
vs 7281
for an article and i am trying to track down the difference. are the unknown counted or not?
with ignore-crawlers false
i get 14335
vs 14309
, which is much better, as probably there are only around 1000 crawlers really. some of the goaccess
built in crawler must be very heavy handed. but looking at the list in the source code nothing obvious is jumping out at me.
for now i think it's better to disable ignore-crawlers
and calculate in 5-10% crawlers...
So this issue isn't something I'm too interested in any more as we're just using bugsnag for stability score, however this issue I reported does still sound like a problem. If you could provide a list of user agents that are acceptable or unacceptable that would solve the HacketNews example you provided, whitelist and blacklist. Maybe that's already a feature, it's been too long since I used this repo.
In my use case, the product has comparatively few users, just some large companies coming onto the site to grab some generated reports for the most part. Or the occasional api request every minute or so. The vast vast majority of our traffic is from bots spamming endpoints. So in my case it was extra important that I try to ignore their requests as much as possible when generating a stability score.
I agree that different use cases need different user agents... I updated this issue instead of creating a new one because I thought they are related, but I am starting to have second thoughts :} For now I just turned off filtering crawlers, I think I should open a new issue if I don't trust what it filters out...
This is something that needs to be revised and look into it. However, you can always get a log of all those unknowns and get a better sense of who is hitting your server.
--unknowns-log=<filename>
Log unknown browsers and OSs to the specified file.
Regular users normally don't have unknown browsers, and if they do I don't really need them in the report to get a good representation of my userbase. With just ignoring crawlers I still have tons of very obvious bots in my report, with /phpmyadmin/ being one of the most popular paths being hit.
Alternatively, being able to filter out all redirects would help too since spammed urls like the above mentioned /phpmyadmin/ just get redirected to the home page anyway.
Essentially I want to filter out as many non-users as possible. I'm doing this so I can try to calculate a more accurate stability score.