Stevie-Ray / referrer-spam-blocker

Apache, Nginx, IIS, uWSGI, Caddy & Varnish blacklist + Google Analytics segments to prevent referrer spam traffic 🤖
MIT License
374 stars 86 forks source link

Google-exclude List too long to save in google analytics #111

Closed daugsbi closed 7 years ago

daugsbi commented 7 years ago

As suggested in the readme, I've tried to create a segment with google-exclude.txt list as a regex expression. The rule could not be saved. With half of the list, it has worked perfectly.

Expected Behavior

It should be possible to copy paste the regex expression and save it.

Possible Solution

The Regex-Expression should not exceed 30'000 character for a segment, as stated by Arnold M here (https://www.en.advertisercommunity.com/t5/Google-Analytics-Referral-Spam/advanced-segment-maximum-length-for-regular-expressions/td-p/504007).

Verify if this is the case and shorten the regex expression either by less urls or more general regex expressions.

Stevie-Ray commented 7 years ago

Hi @daugsbi, thanks for your bug report. I think I wouldn't be able to write a regex to cover all the URL's and still be within the 30'000 characters. What could work is to check if the URLs in the spam list still exist and only add the active spammers. The only downside with this is that the Google Analytics exclude list will work with spam from way back.. So what would you suggest?

daugsbi commented 7 years ago

It's possible to create two segments to filter spam and use both of them (up to four segments can be applied, so still 2 available to combine individual segments). You could split the addresses beginning with 0-L and M-Z and reflect this change in the readme.

Stevie-Ray commented 7 years ago

Hi @daugsbi, I've added a method to create two segments, can you verify if it works? Still have to write some code to be sure it will still work if we hit the 60.000+ .. we are now on 48.000+ I'll update the readme if it works! ;-)

Stevie-Ray commented 7 years ago

@Gamesh Could you please take a look at the newly added Google Analytics segment exclude code?

Gamesh commented 7 years ago

@Stevie-Ray yeah, sure

Gamesh commented 7 years ago

at first glance what i can tell strrpos() has a third argument offset line 226 so substr() could be omitted. Other than that good job 👍 by the way all these string functions are not multi-byte safe (only mb_* are), so can't be used on Unicode domains, good thing you decode them before that.

We should create an optimizer pass, that would reduce the number of lines in the blacklist. One thing that comes to mind is we could search for all domains that have the same beginning. For example 4webmasters.com and 4webmasters.org would become: 4webmasters(\.com|\.org) that would save some space with the cost of one additional iteration.

Ubuntu101 commented 7 years ago

Tried your new php code for splitting up the google-exclude files. I threw a list of 120000+ at it and it does not work creating a third or 4th file. Creation of File1 is good, second file does not end at 30,000 characters and also breaks the input leaving an incomplete domain name. Goes onto create file3 which then is also broken output.

Ubuntu101 commented 7 years ago

@Gamesh wouldn't a 4webmaster(\.*) cover all possible extensions?

Gamesh commented 7 years ago

@Ubuntu101 yes it would, but no one knows if for example .net is a legit domain which we don't want to block blindly, that would defeat the purpose of blacklist.

the code does not create the third or fourth files, it's not written to do that currently. i'll look into it if @Stevie-Ray didn't already start writing this code. But the problem remains as the blacklist grows it will still hit the maximum limit, we can only delay it by spiting into multiple files and optimizing our regex patterns to match more with less.

Ubuntu101 commented 7 years ago

Hope you can figure out the file splitting, I've been messing with it myself but I'm not the greatest at PHP coding 😬 I think it's still the safest way to split them up into equal chunks. Look forward to seeing if you get it right.

Stevie-Ray commented 7 years ago

hi @daugsbi, @Gamesh & @Ubuntu101. I did my best to rewrite the code. It generates the same files as before but the exclude script should now be able to fill a ...-exclude-3 ...-exclude-4 with data. Feel free to do a pull request 👍 🔥 Also the multi-byte safe is new to me i've added a russian domain converter in the past but I don't understand what the mb_* functions do.. Thanks for your feedback!

Ubuntu101 commented 7 years ago

Thanks @Stevie-Ray tried the new code this morning, does the same thing. When it hits the second file it starts falling apart but creates part1 perfectly, part 2 is oversized and cuts off at an incomplete domain name, then part 3 contains data already in part 2 but correctly sized but also beings with a cut off domain name.

screen shot 2017-03-03 at 10 25 40 am

Ubuntu101 commented 7 years ago

@daugsbi PR solves this 100% https://github.com/Stevie-Ray/referrer-spam-blocker/pull/113 tested and working like a charm. Well done @daugsbi 👍

screen shot 2017-03-03 at 10 32 29 am

Stevie-Ray commented 7 years ago

@daugsbi i've murged your code from PR #113 ..thanks man! and also @Ubuntu101 thank you for testing the script so we can confirm it works!