allinurl / goaccess

GoAccess is a real-time web log analyzer and interactive viewer that runs in a terminal in *nix systems or through your browser.
https://goaccess.io
MIT License
18.12k stars 1.1k forks source link

Filtering referrers out #1841

Closed Mikanoshi closed 4 years ago

Mikanoshi commented 4 years ago

How do hide-referer and ignore-referer work exactly? Are they matching against referrer domain or the whole URL? Does * match dots? Does ignore-referer exclude log line from all stats or just from referrer ones? I have this defined in config:

hide-referer *.domain.com
hide-referer domain.com

but I can still see domain.com, www.domain.com and DOMAIN.COM (without protocol) in Referrers URLs pane. Domain is partially removed from Referring Sites pane, but domains like these are still there:

www.sub.domain.com
sub2.domain2.tld.domain.com
domain.com.
allinurl commented 4 years ago

Please let me know if this helps clarify your question:

google.net:80 93.146.86.3 - - [01/Mar/2016:06:11:31 -0600] "GET /images/dia_carino_corazon_gt.png HTTP/1.1" 200 1059 "http://www.google.net/blog/6/como-revivir-una-flor-marchita" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36" 2642
google.net:80 93.146.86.3 - - [01/Mar/2016:06:11:31 -0600] "GET /images/bg.jpg HTTP/1.1" 200 6419 "http://abc.google.net/css/style.css?2011082301" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36" 2526
google.net:80 93.146.86.3 - - [01/Mar/2016:06:11:31 -0600] "GET /images/xml.gif HTTP/1.1" 200 503 "http://cde.fgh.google.net/blog/6/como-revivir-una-flor-marchita" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36" 1166
google.net:80 93.146.86.3 - - [01/Mar/2016:06:11:31 -0600] "GET /banners/1181867705.jpg HTTP/1.1" 200 48385 "http://google.net/blog/6/como-revivir-una-flor-marchita" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36" 5228
goaccess --ignore-referer=*.google.net --ignore-referer=google.net ../logs/oneliner.log

vs

goaccess --hide-referer=*.google.net --hide-referer=google.net ../logs/oneliner.log

vs

goaccess --ignore-referer=*.google.net --ignore-referer=google.net --ignore-referer=*fgh.google.net ../logs/oneliner.log

Notice the total # of requests and valid requests.

2020-07-13-163841_574x2123_scrot

2020-07-13-163911_574x2123_scrot

2020-07-13-164109_574x2123_scrot

Mikanoshi commented 4 years ago

So it is impossible to hide/ignore all entries that have specified domain in referrer regardless of its format? Does GoAccess expect referrer to be a valid URL? I want to exclude ANY entries that mention my own domains, with any number of subdomains. Referrer is easily faked, so it can be anything. Vulnerability scanners just put domain name there, without protocol, filter fails for such entries.

allinurl commented 4 years ago

Not at the moment. However, it's a straightforward change on this line https://github.com/allinurl/goaccess/blob/master/src/parser.c#L1727

from

if (ignore_referer (logitem->site))

to

if (ignore_referer (logitem->ref))

Before I do that, are you able to post a few sample lines from your access log so I can better understand the entries you're seeing? Thanks

Mikanoshi commented 4 years ago

It's a slightly modified nginx combined log format:

185.12.124.78 domain.com - [08/Jul/2020:11:06:28 +0500] "GET /domain.com/ HTTP/1.1" 200 10857 "DOMAIN.COM" "Mozilla/5.0 (compatible; BackupLand/1.0; https://go.backupland.com/; Domain check for viruses;)" rt=0.000 uct=0.000 uht=0.000 urt=0.000 ucache=BYPASS
54.38.81.231 domain.com - [08/Jul/2020:04:37:34 +0500] "GET /nodes.domain.com/wp-admin/admin-ajax.php?action=revslider_show_image&img=../wp-config.php HTTP/1.1" 301 306 "nodes.domain.com" "Mozilla/5.0 (Linux; Android 8.1.0; ZB602KL) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.89 Mobile Safari/537.36" rt=0.000 uct=0.000 uht=0.000 urt=0.000 ucache=BYPASS
5.188.210.87 domain.com - [14/Jul/2020:02:16:20 +0500] "GET /domain.reformal.com.domain.com/forum/index.php HTTP/1.0" 404 196 "http://domain.reformal.com.domain.com/forum/index.php" "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 Safari/537.36" rt=0.000 uct=0.000 uht=0.000 urt=0.000 ucache=MISS

Basically referrers without protocol and with a lot of subdomains.

allinurl commented 4 years ago

This has been added upstream. It will be pushed out in the upcoming version. Thanks again!

allinurl commented 4 years ago

BTW, feel free to give it a shot. This can be done if you build from development.

Mikanoshi commented 4 years ago

Ignore-referer only? I actually use hide-referer. https://github.com/allinurl/goaccess/blob/master/src/parser.c#L1220 Also what about syntax? Will it require something like this? hide-referer=*google.net* And is it going to match multiple subdomains? * matches dot?