Do we need to change our blacklisting guidelines?

angussidney commented 7 years ago

When we wrote our blacklisting guidelines in October last year, we set the following requirements:

Website has been used in at least 2 confirmed instances (tp fed back reports) of spam (You can use https://metasmoke.erwaysoftware.com/search to find other instances of a website being used for spam).

Website is not used legitimately in other posts on Stack Exchange.

Website is not currently caught in any of these filters:

bad keyword in body

blacklisted website

pattern matching website

Circumstances have changed since then, and the number of blacklists has grown. With the addition of the !!/blacklist-* commands, over 830 more websites/keywords/usernames have been added to our blacklists. In fact, 106 (!!!) were made in the last five days alone. Many of these websites are already caught by one or two of the reasons specified above.

Considering this, I think we need to have a discussion over whether these guidelines need to be changed to reflect the way we should/are using blacklists now. What should our new guidelines be?

Do we want to be blacklisting every spammy site that we see? Do we want to leave it to extreme circumstances?
Should we instead focus our time on improving our pattern-matching-* reasons?
Should average autoflag weight of matched posts have anything to do with this?
Should manually reported/posts with only 1 reason be given extra weight when counting the need for a blacklist?
Are our current guidelines just fine, and do we just need to enforce them more?

Other things we should think about:

If we are going to blacklist everything, do we want to automate it somehow?
What sort of a performance hit does blacklisting make? (I think Art ran some stats on this a while ago, maybe they need to be re-run with the updated codebase)
How much do dormant blacklists clutter the list? Do we need to think about code readability?
Should blacklist entries be removed if they don't have any hits after a certain time?
What does the !!/watch-keyword command have to do with this? Should it follow similar guidelines? Should it have separate ones? Do we need to change the way that it is implemented, to give it 0 weight or not send reports to MS?

What does everyone think about this?

angussidney commented 7 years ago

The majority of my points below relate to blacklisted websites only, because:

Usernames - they're rare enough that they don't really need any updated guidelines
Keywords - there isn't much we can do other than blacklist them. We could possibly work on pattern-matching groups of keywords, but that's a different issue.

Anyway, my thoughts on this matter:

Our current rate of blacklisting is a bit overboard, and has little to no direction.
Pharma websites get blacklisted, hit one or two posts, then never get used again. Is it really worth blacklisting most of these? Patterns are much more useful and last longer, and would be a better solution. However, if we do decide to continue blacklisting these, the process should be semi-automated, and websites should be removed if they go a few months without being hit. I started work a while ago on a blacklist auditor which may assist with this, but I never finished it.
Pattern websites should go in the pattern-matching website rule. Using a pattern for different TLDs, or \W for word boundaries is fine, but beyond that they shouldn't be in the blacklist.
If a website has already been caught by bad keyword in body, pattern-matching product name, and pattern matching website, does it really need to be blacklisted as well?
While I do support blacklisting websites in order for them to reach the standard autoflagging threshold (280), that doesn't mean that we should be blacklisting them whenever a single post doesn't reach it. If a website catches 50 posts and only 5 don't reach the threshold, it's not worth blacklisting. If it catches 10 posts and 2 don't get caught, it's not worth blacklisting. But if 40 or so posts out of 100 don't reach the threshold, then it starts to be worthwile.
I think websites in manually reported posts should be blacklisted in order to catch future posts (particularly when they are used multiple times)

Also, my thoughts on !!/watch-keyword:

I've asked this before and it's been explained to me, but I still don't get what makes it any different from the blacklisted keywords. Yeah, it has a different name, a different autoflag weight, and has additional information (time and name of blacklister), but in reality, how does this affect anything? How does it have a 'lower theshold than blacklisting'? It still triggers the same process of reporting to all rooms, sending to MS, and possibly autoflagging. Lots of people trust Smokey for it's accuracy - if you're not confident enough to add a keyword to the blacklist, it's not good enough to be added to a watchlist.
- If the system were changed somehow so that the reason had 0 weight, didn't report to other rooms, or otherwise didn't trigger a full 'report' (at least on it's own) - i.e. use it as a notification watchlist rather than a blacklist with a different name - then I'd fully support it. It seems like a great way to bring posts to our attention without bringing in all of the heavy weapons. But currently, the reason isn't any different from blacklisted keywords.

AWegnerGitHub commented 7 years ago

Do we want to be blacklisting every spammy site that we see? Do we want to leave it to extreme circumstances?

We do not need to be blacklisting every site. It's not scale able and sites drop in usage so quickly that we have links that aren't used very quickly. With these still on the blacklists though, we have to check those on every post.

Should we instead focus our time on improving our pattern-matching-* reasons?

I'd prefer this option to tens of blacklisted sites a day.

Should average autoflag weight of matched posts have anything to do with this?

Eh...maybe? I think we'd need to talk about this a bit more. To start though, I'd say no.

Should manually reported/posts with only 1 reason be given extra weight when counting the need for a blacklist?

Not necessarily. I'd rather see if these can been added to the patterns first. If it can't be, then it should be added to the blacklist.

Are our current guidelines just fine, and do we just need to enforce them more?

Enforce them.

If we are going to blacklist everything, do we want to automate it somehow?

Automation is always good.

Should blacklist entries be removed if they don't have any hits after a certain time?

Yes!

What does the !!/watch-keyword command have to do with this? Should it follow similar guidelines? Should it have separate ones? Do we need to change the way that it is implemented, to give it 0 weight or not send reports to MS?

I thought it already had 0 weight and wasn't reported. This needs to occur. It's not a blacklist yet and shouldn't be treated as one.

ArtOfCode- commented 7 years ago

I'm on mobile, so typing a lot of thoughts is difficult, but I do in general agree with what Angus and Andy are saying here. In particular:

watched keywords absolutely shouldn't have an autoflag weight. If we're not confident enough to blacklist them, we're nowhere confident enough to flag them. This can be implemented Smokey-side (don't send the watched-keyword reason to metasmoke) or metasmoke side (never set its weight to anything above zero).
when I ran stats, the blacklists were a significant proportion of the time taken to run all rules on a post (think 70%). Nuking entries that aren't getting any hits or are old and no use any more would help significantly with this, I think. If someone could finish some sort of blacklist auditor, that would be even better.

tripleee commented 7 years ago

I'll try to post a more formalized response to all of the questions raised here, but just to quickly respond to the points I think are the most urgent;

Indeed, the "watch" proposal could and perhaps should be tweaked down as Angus suggests. I'll see if I can pin the weight in Metasmoke, though I'm not very familiar with its codebase or Ruby. I intended it precisely as an aid to receive an alert when something which looks suspicious but does not yet meet the current criteria for blacklisting happens again. These alerts could arguably be constrained to the Charcoal chatroom only, though I haven't looked into how hard this is to control.
My experience with the Indian pharma spammer's sites is that they do not pop up and then go permanently dormant. They have a large number of sites and they seem to cycle them in different ways on different sites. Some come back after a year or more and then engage in heavy drumfire over a few days, then go dormant again.
All of this should definitely be automated. My thinking is that the watch mechanism is a step in that direction, but down the line, we should perhaps have a separate analytical back end which creates new higher-order heuristics from the features we can mine out of Metasmoke.

tripleee commented 7 years ago

Regarding pattern matching rules, I want to say that they have their uses, but also their dangers. As a demonstrative example, almost any pattern which matches a good subset of the Indian pharma spammer's domain names would need to be disabled on e.g. Health.SE. An explicit blacklist with one domain name per row will be usable even on that site, under the spammer's current modus operandi.

Also, an exhaustive list of this spammer's domain names would be a nice thing to have, even though of course it doesn't necessarily have to be in our list of blacklisted websites (which contains many unrelated domains and patterns, anyway).

Undo1 commented 7 years ago

We used to have a tiny bit of this enforced (or at least automatically checked) by metasmoke, but that broke at some point and I haven't fixed it. I'm skeptical as always of any arguments involving blacklist speed, as I haven't seen data saying they're a Real Issue (but we do have data saying they aren't).

A good step (possibly a prereq) in automating this would be the ability to say which line / blacklist item a match came from, and store that in metasmoke. That'd make higher order heuristics much more possible.

I agree with Andy on everything else. Need to get some data-driven processes around this or we're going to bikeshed on every blacklist.

tripleee commented 7 years ago

Here's my informal off the cuff proposal for an updated guideline. This is posted here as a basis for discussion, not directly for approval or rejection.

watched_keywords -- anything is game, but be prepared to have it removed if circumstances require it.
- We will be removing patterns periodically; you can reduce the risk of having useful patterns removed by proactively removing patterns you no longer are interested in, or which produce very uncertain value.
- Autoflagging weight will be forced to be 1.¹
- Smoke Detector will regard these rules as "experimental"; it will not alert in other rooms than Charcoal HQ if there are hits solely from this set of rules.
blacklisted_websites.txt -- reserved for sites which we are highly confident that are used only in spam. You may add a site to this list if one of the following is true.²
- The site has at least five hits in Metasmoke, with no false positives, and at least one of them is below the default autoflagging threshold (currently, 280) and no older than six months.
- There are more than twenty hits in the last six months, and no false positives.
bad_keywords.txt -- reserved for phrases which we are highly confident that are used only in spam. You may add a phrase to this list if the following is true.
- The phrase has been used recently in spam and has no false positives in Metasmoke, and searching on Stack Exchange indicates that it is not a common phrase on any site in the network.³
  
  ¹ You'll notice the weight of 1 (not 0) for the watched keywords. There's a separate issue https://github.com/Charcoal-SE/metasmoke/issues/178 about this. ² This says nothing about crosstalk from "pattern-matching websites". My proposal is to "don't guess when you know" and reserve the blacklist purely for sites of which we are highly certain. The weaker heuristics will occasionally capture sites which are already covered by the blacklist; this is harmless, and is actually a vindication that the patterns are valid. ³ No mention of autoflagging threshold or number of hits.

j-f1 commented 7 years ago

Smokey should probably do a couple API calls and reject any blacklists that don’t follow the rules unless the user is a code admin and uses !!/blacklist-<thing> -f.

tripleee commented 7 years ago

Angus suggested in chat a much shorter period for the "recently below autoflag" criterion. I'd be fine with, say, two months (last ~5000 posts) instead of six, at least until we have collected more data about thresholds so that we can make a more informed decision.

ArtOfCode- commented 7 years ago

I'd probably look for more than "at least one" being under the flag threshold. Express it as a percentages, maybe - blacklist if >=5% are under threshold?

tripleee commented 7 years ago

@ArtOfCode- the "at least one" is in line with the "don't guess if you know". If we can improve the "blacklisted website" heuristic to 100% and adding a web site does not jeopardize that goal, the ability to detect known spam with confidence trumps the efficiency and cautiousness arguments (presupposing of course that due diligence has been performed, i.e. that we are pretty darn sure there are no legitimate posts with that domain name).

magisch commented 7 years ago

I'm conflicted on this. If we ever start getting performance issues, this will probably be one of the reasons why, as @ArtOfCode- mentioned.

Undo1 commented 7 years ago

That's a pretty big if, @magisch. Art's EC2 has been consistently in the 50-70pps range for the last three months (https://metasmoke.erwaysoftware.com/smoke_detector/2/statistics?page=1). I haven't seen a data-backed reason to be concerned about performance, and we're running this thing on... basically toasters.

ArtOfCode- commented 7 years ago

Data

Compiling the blacklisted websites regex and using it to match against a 2000-character text takes 6.3ms, on average, taken across 10k repeats.

Observations That's not a lot. However, assume that we're on my EC2, running ~60 posts per second. That's 16.7ms per post. 6.3ms is 37.8% of that time. We're probably looking at a similar but slightly smaller figure for blacklisted keywords - that means that these two reasons are taking up ~75% of our post processing time.

Is that a problem? No. Are we likely to hit critical processing time (i.e. taking longer to process a post than we have between posts coming in)? No. Could we be more efficient? Yes. Adding some of the measures that have been tossed around in this discussion could see the percentage (or just the absolute time) that blacklists use reduced, which would increase our PPS value. We don't absolutely need it, but faster systems are generally not a bad thing - particularly given that we don't always run on an EC2 with plenty of processing power.

Undo1 commented 7 years ago

@ArtOfCode- Curious - how does it scale by number of entries? If it's O(n^2), that's a lot more worrisome than if it's linear or O(ln(n)), for example.

ArtOfCode- commented 7 years ago

@Undo1 It's actually under linear.

10 entries: 0.7ms
100 entries: 1.7ms (O(0.24n))
1000 entries: 9.8ms (O(0.64n))
10000 entries: 85.3ms (O(0.56n))

O-notations are rolling averages.

SmokeDetector commented 7 years ago

Almost looks logarithmic, then. Still okay to optimize, but by not really a huge priority.

-- Undo

ArtOfCode- commented 7 years ago

Not logarithmic either, @Undo1 - I just ran more data and got a graph:

graph

magisch commented 7 years ago

@ArtOfCode- Looks like it approaches around O(0.5+/- n) eventually. So not logarithmic, but slightly under-linear and certainly not exponential.

tripleee commented 7 years ago

Here are some recent watchlist additions which previously I would have blacklisted as "slam dunk". I'm thinking I'd like to add something like "more than 30 TP and no FP" as one condition for blacklisting a website, i.e. relax the age condition if there is enough proof.

burnfatin4days.com 30/30 but was 28/28 when I added it to the watch list and then #22 was the latest below auto (at 193) 6 months ago
mysupplementsera.com 52/52 but #30 was latest below auto (at 91) four months ago
pre-workoutideas.com 32/32 but #18 was latest below auto (at 196) 5 months ago
jackedmuscleextremeadvice.com 43/43 but #42 was latest below auto (at 97) 7 months ago
slimatrexnorway.com 38/38 but #21 was latest below auto (at 193) 3 months ago (though several lately at 290-ish)

ArtOfCode- commented 7 years ago

@tripleee I don't get it. If the latest post below the autoflagging threshold was that long ago, why do we need to give it more weight?

tripleee commented 7 years ago

I'll try to summarize my reasoning but there are probably still more reasons.

Transparency and maintainability. The blacklist is our easiest mechanism to grasp, and blacklisting a domain unequivocally identifies it as a known spammer domain.
Input for higher-level development. Having good coverage for the blacklist allows us to reason about groups of blacklisted domains and discover new patterns (be it in domain names, hosting arrangements, or patterns in recurrence in new spam, etc).
Robustness. Having a high-profile domain name match a complex and hairy domain name heuristic regex works here and now, but there is no guarantee that we don't break this coincidentally when developing the complex and hairy domain name heuristic regex (e.g. to block false positives or to refactor it to hopefully be less hairy). Furthermore, a heuristic is by definition weaker than an explicit blacklisting. The heuristic is hard to understand and develop, and tricky to modify without breaking it. (Good regression tests would help here, of course; but they don't remove the threshold for grokking the code in the first place.)
Defense in depth. Having multiple rules hit means we don't rely on any one single heuristic (or pairs of triplets of them; many of the high hitters have significant overlap in their semantics) which makes us less vulnerable to the spammer changing their MO or finding new and interesting obfuscation techniques.
Code manageability. An individual blacklist entry can easily be assessed for obsolesence, hit rate etc; a complex regex is much harder to vet against e.g. search engine results and so you tend to be stuck with the individual regex atoms basically forever.

Tangentially maybe see also the discussion for pending pull request https://github.com/Charcoal-SE/SmokeDetector/pull/757

tripleee commented 7 years ago

Another brief discussion here, with some additional viewpoints. http://chat.stackexchange.com/transcript/message/38215977#38215977

tripleee commented 7 years ago

Another brief discussion, or more like amplification of the above: http://chat.stackexchange.com/transcript/message/38262392#38262392

tripleee commented 7 years ago

Here is now my attempt at a final proposal. I have not received any feedback on the limits so I leave them at the proposed numbers.

This is basically identical to the proposal from a month ago, with the amendment to allow for blacklisting domain names with substantial evidence from more than 6 months back but few recent hits. Also, the keyword blacklisting requirement is now at least two hits.

The bullet points with Rationale: are for background, and I don't think they need to be included in the wiki documentation eventually.

watched_keywords -- anything is game, but be prepared to have it removed if circumstances require it.
- We will be removing patterns periodically; you can reduce the risk of having useful patterns removed by proactively removing patterns you no longer are interested in, or which produce very uncertain value.
- Rationale: We want to prevent the list from growing indefinitely. Eventually, there should be automated expiration for patterns with no hits or otherwise low value.
- Autoflagging weight for this reason is technically forced to stay at 1.
- Rationale: The watch list is for guiding our analysis, not for necessarily identifying spam. A small weight helps prevent autoflagging false positives.
- Smoke Detector will regard these rules as "experimental"; it will not alert in other rooms than Charcoal HQ if there are hits solely from this set of rules.
- Rationale: Again, the analysis and development work happens in the Charcoal room, and alerts are only useful there.
blacklisted_websites.txt -- reserved for sites which we are highly confident that are used only in spam. You may add a site to this list if one of the following is true.
- The site has at least five hits in Metasmoke, with no false positives, and at least one of them is below the default autoflagging threshold (currently, 280) and no older than six months.
- Rationale: We want to avoid bloating the blacklist with transient domains which pop up, run a campaign with a handful of spam posts, and then quickly disappear forever.
- There are more than twenty hits in the last six months, and no false positives.
- Rationale: With this amount of spam, the domain is arguably not a quick whack-a-mole one-off. Indeed, getting rid of this type of spam is one of the central goals of the Charcoal project.
- There are recent hits, and more than 30 hits overall, and no false positives.
- Rationale: With this amount of spam, the domain is clearly a fixture, and it might be making a return after having been dormant. Again, this type of spam should not require any significant human intervention, and explicitly blacklisting the domain helps avoid spending analysis time on this already well-established fact.
bad_keywords.txt -- reserved for phrases which we are highly confident that are used only in spam. You may add a phrase to this list if the following is true.
- The phrase has been used repeatedly in recent spam and has no false positives in Metasmoke, and searching on Stack Exchange indicates that it is not a common phrase on any site in the network.
- Rationale: While this is more relaxed than the other rules, it codifies current practice, and documents what is currently undocumented.

ArtOfCode- commented 7 years ago

@tripleee's suggestion works for me,and makes sense with the rationale included (thanks). I'm gonna wait until the US wakes up and has had a chance to read it, but other than that AFAIC this can go in the wiki.

AWegnerGitHub commented 7 years ago

At this point, I think we can move those guidelines to the wiki. It's been over a week with no objections.

tripleee commented 7 years ago

I edited the wiki pages by clicking "Edit on GitHub" but I'm not sure if something still needs to be done to make the changes visible on charcoal-se.org too.

ArtOfCode- commented 7 years ago

That's done and live now.

Charcoal-SE / SmokeDetector

Do we need to change our blacklisting guidelines? #771

What does everyone think about this?