Can we ignore unknowns in the same way we ignore crawlers?

Pi-George commented 4 years ago

Regular users normally don't have unknown browsers, and if they do I don't really need them in the report to get a good representation of my userbase. With just ignoring crawlers I still have tons of very obvious bots in my report, with /phpmyadmin/ being one of the most popular paths being hit.

Alternatively, being able to filter out all redirects would help too since spammed urls like the above mentioned /phpmyadmin/ just get redirected to the home page anyway.

Essentially I want to filter out as many non-users as possible. I'm doing this so I can try to calculate a more accurate stability score.

sergeykish commented 4 years ago

It should be simple to construct own list. Only crawlers access robots.txt.

$ tail -n +2 sergeykish.com_usage_* | grep robots.txt | cut -d, -f13 -s | sort -u | sed 's/^"//'
apitools gsutil/4.50 Python/3.8.3 (linux) google-cloud-sdk/293.0.0 analytics/disabled
BananaBot/0.6.1
BorneoBot/0.9.1 (crawlcheck123@gmail.com)
...

I have logs in Google Cloud Storage format. Mimic https://github.com/allinurl/goaccess/blob/master/src/browsers.c#L125

$ tail -n +2 sergeykish.com_usage_* | grep robots.txt | cut -d, -f13 -s | sort -u | sed 's/^"//;s/$/\tCrawlers/' > browsers.list

Now run goaccess --ignore-crawlers --browsers-file=browsers.list

EDIT: It blocks too much — cut splits User Agent string on first ,

Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://aspiegel.com/petalbot),gzip(gfe)

leading to blocking entire

Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML

I've tried

$ tail -n +2 sergeykish.com_usage_2020_* | grep robots.txt | awk -F'","' '{print $13, "\tCrawlers" }' | sort -u

goaccess can't process it though no tabs in User Agent

Fatal error has occurred
Error occurred at: src/browsers.c - parse_browser_token - 279
Malformed browser name at line: 10

Looks like the way to go is get unique bot names from

 $ tail -n +2 sergeykish.com_usage_2020_* | grep robots.txt | awk -F'","' '{print $13}' | sort | uniq -c | sort -n

In my case bots top overall list too

 $ tail -n +2 sergeykish.com_usage_2020_* | awk -F'","' '{print $13}' | sort | uniq -c | sort -n

minusf commented 3 years ago

Regular users normally don't have unknown browsers, and if they do I don't really need them in the report to get a good representation of my userbase.

i think this has not been the case for many years now. there are tons of unknown user agents that are legitimate: all the embedded browsers and all the apps on the mobile phones. for example sometimes our greatest traffic comes from the hacker news app which has useragents like: HackerNews/1493 CFNetwork/1240.0.4 Darwin/20.6.0

should this be counted by goaccess? i think definitely.

actually i came here to report that i see sometimes huge discrepancies between our manually pruned log files and what goaccess reports with ignore-crawlers: 13368 vs 7281 for an article and i am trying to track down the difference. are the unknown counted or not?

with ignore-crawlers false i get 14335 vs 14309, which is much better, as probably there are only around 1000 crawlers really. some of the goaccess built in crawler must be very heavy handed. but looking at the list in the source code nothing obvious is jumping out at me.

for now i think it's better to disable ignore-crawlers and calculate in 5-10% crawlers...

Pi-George commented 3 years ago

So this issue isn't something I'm too interested in any more as we're just using bugsnag for stability score, however this issue I reported does still sound like a problem. If you could provide a list of user agents that are acceptable or unacceptable that would solve the HacketNews example you provided, whitelist and blacklist. Maybe that's already a feature, it's been too long since I used this repo.

In my use case, the product has comparatively few users, just some large companies coming onto the site to grab some generated reports for the most part. Or the occasional api request every minute or so. The vast vast majority of our traffic is from bots spamming endpoints. So in my case it was extra important that I try to ignore their requests as much as possible when generating a stability score.

minusf commented 3 years ago

I agree that different use cases need different user agents... I updated this issue instead of creating a new one because I thought they are related, but I am starting to have second thoughts :} For now I just turned off filtering crawlers, I think I should open a new issue if I don't trust what it filters out...

allinurl commented 3 years ago

This is something that needs to be revised and look into it. However, you can always get a log of all those unknowns and get a better sense of who is hitting your server.

       --unknowns-log=<filename>
              Log unknown browsers and OSs to the specified file.

allinurl / goaccess

Can we ignore unknowns in the same way we ignore crawlers? #1765