Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
474 stars 182 forks source link

Do we need to change our blacklisting guidelines? #771

Closed angussidney closed 7 years ago

angussidney commented 7 years ago

When we wrote our blacklisting guidelines in October last year, we set the following requirements:

  • Website has been used in at least 2 confirmed instances (tp fed back reports) of spam (You can use https://metasmoke.erwaysoftware.com/search to find other instances of a website being used for spam).
  • Website is not used legitimately in other posts on Stack Exchange.
  • Website is not currently caught in any of these filters:
    • bad keyword in body
    • blacklisted website
    • pattern matching website

Circumstances have changed since then, and the number of blacklists has grown. With the addition of the !!/blacklist-* commands, over 830 more websites/keywords/usernames have been added to our blacklists. In fact, 106 (!!!) were made in the last five days alone. Many of these websites are already caught by one or two of the reasons specified above.

Considering this, I think we need to have a discussion over whether these guidelines need to be changed to reflect the way we should/are using blacklists now. What should our new guidelines be?

Other things we should think about:

What does everyone think about this?

angussidney commented 7 years ago

The majority of my points below relate to blacklisted websites only, because:

Anyway, my thoughts on this matter:

Also, my thoughts on !!/watch-keyword:

AWegnerGitHub commented 7 years ago

Do we want to be blacklisting every spammy site that we see? Do we want to leave it to extreme circumstances?

We do not need to be blacklisting every site. It's not scale able and sites drop in usage so quickly that we have links that aren't used very quickly. With these still on the blacklists though, we have to check those on every post.

Should we instead focus our time on improving our pattern-matching-* reasons?

I'd prefer this option to tens of blacklisted sites a day.

Should average autoflag weight of matched posts have anything to do with this?

Eh...maybe? I think we'd need to talk about this a bit more. To start though, I'd say no.

Should manually reported/posts with only 1 reason be given extra weight when counting the need for a blacklist?

Not necessarily. I'd rather see if these can been added to the patterns first. If it can't be, then it should be added to the blacklist.

Are our current guidelines just fine, and do we just need to enforce them more?

Enforce them.

If we are going to blacklist everything, do we want to automate it somehow?

Automation is always good.

Should blacklist entries be removed if they don't have any hits after a certain time?

Yes!

What does the !!/watch-keyword command have to do with this? Should it follow similar guidelines? Should it have separate ones? Do we need to change the way that it is implemented, to give it 0 weight or not send reports to MS?

I thought it already had 0 weight and wasn't reported. This needs to occur. It's not a blacklist yet and shouldn't be treated as one.

ArtOfCode- commented 7 years ago

I'm on mobile, so typing a lot of thoughts is difficult, but I do in general agree with what Angus and Andy are saying here. In particular:

tripleee commented 7 years ago

I'll try to post a more formalized response to all of the questions raised here, but just to quickly respond to the points I think are the most urgent;

tripleee commented 7 years ago

Regarding pattern matching rules, I want to say that they have their uses, but also their dangers. As a demonstrative example, almost any pattern which matches a good subset of the Indian pharma spammer's domain names would need to be disabled on e.g. Health.SE. An explicit blacklist with one domain name per row will be usable even on that site, under the spammer's current modus operandi.

Also, an exhaustive list of this spammer's domain names would be a nice thing to have, even though of course it doesn't necessarily have to be in our list of blacklisted websites (which contains many unrelated domains and patterns, anyway).

Undo1 commented 7 years ago

We used to have a tiny bit of this enforced (or at least automatically checked) by metasmoke, but that broke at some point and I haven't fixed it. I'm skeptical as always of any arguments involving blacklist speed, as I haven't seen data saying they're a Real Issue (but we do have data saying they aren't).

A good step (possibly a prereq) in automating this would be the ability to say which line / blacklist item a match came from, and store that in metasmoke. That'd make higher order heuristics much more possible.

I agree with Andy on everything else. Need to get some data-driven processes around this or we're going to bikeshed on every blacklist.

tripleee commented 7 years ago

Here's my informal off the cuff proposal for an updated guideline. This is posted here as a basis for discussion, not directly for approval or rejection.

j-f1 commented 7 years ago

Smokey should probably do a couple API calls and reject any blacklists that don’t follow the rules unless the user is a code admin and uses !!/blacklist-<thing> -f.

tripleee commented 7 years ago

Angus suggested in chat a much shorter period for the "recently below autoflag" criterion. I'd be fine with, say, two months (last ~5000 posts) instead of six, at least until we have collected more data about thresholds so that we can make a more informed decision.

ArtOfCode- commented 7 years ago

I'd probably look for more than "at least one" being under the flag threshold. Express it as a percentages, maybe - blacklist if >=5% are under threshold?

tripleee commented 7 years ago

@ArtOfCode- the "at least one" is in line with the "don't guess if you know". If we can improve the "blacklisted website" heuristic to 100% and adding a web site does not jeopardize that goal, the ability to detect known spam with confidence trumps the efficiency and cautiousness arguments (presupposing of course that due diligence has been performed, i.e. that we are pretty darn sure there are no legitimate posts with that domain name).

magisch commented 7 years ago

I'm conflicted on this. If we ever start getting performance issues, this will probably be one of the reasons why, as @ArtOfCode- mentioned.

Undo1 commented 7 years ago

That's a pretty big if, @magisch. Art's EC2 has been consistently in the 50-70pps range for the last three months (https://metasmoke.erwaysoftware.com/smoke_detector/2/statistics?page=1). I haven't seen a data-backed reason to be concerned about performance, and we're running this thing on... basically toasters.

ArtOfCode- commented 7 years ago

Data

Observations That's not a lot. However, assume that we're on my EC2, running ~60 posts per second. That's 16.7ms per post. 6.3ms is 37.8% of that time. We're probably looking at a similar but slightly smaller figure for blacklisted keywords - that means that these two reasons are taking up ~75% of our post processing time.

Is that a problem? No. Are we likely to hit critical processing time (i.e. taking longer to process a post than we have between posts coming in)? No. Could we be more efficient? Yes. Adding some of the measures that have been tossed around in this discussion could see the percentage (or just the absolute time) that blacklists use reduced, which would increase our PPS value. We don't absolutely need it, but faster systems are generally not a bad thing - particularly given that we don't always run on an EC2 with plenty of processing power.

Undo1 commented 7 years ago

@ArtOfCode- Curious - how does it scale by number of entries? If it's O(n^2), that's a lot more worrisome than if it's linear or O(ln(n)), for example.

ArtOfCode- commented 7 years ago

@Undo1 It's actually under linear.

O-notations are rolling averages.

SmokeDetector commented 7 years ago

Almost looks logarithmic, then. Still okay to optimize, but by not really a huge priority.

-- Undo

ArtOfCode- commented 7 years ago

Not logarithmic either, @Undo1 - I just ran more data and got a graph:

graph

magisch commented 7 years ago

@ArtOfCode- Looks like it approaches around O(0.5+/- n) eventually. So not logarithmic, but slightly under-linear and certainly not exponential.

tripleee commented 7 years ago

Here are some recent watchlist additions which previously I would have blacklisted as "slam dunk". I'm thinking I'd like to add something like "more than 30 TP and no FP" as one condition for blacklisting a website, i.e. relax the age condition if there is enough proof.

ArtOfCode- commented 7 years ago

@tripleee I don't get it. If the latest post below the autoflagging threshold was that long ago, why do we need to give it more weight?

tripleee commented 7 years ago

I'll try to summarize my reasoning but there are probably still more reasons.

Tangentially maybe see also the discussion for pending pull request https://github.com/Charcoal-SE/SmokeDetector/pull/757

tripleee commented 7 years ago

Another brief discussion here, with some additional viewpoints. http://chat.stackexchange.com/transcript/message/38215977#38215977

tripleee commented 7 years ago

Another brief discussion, or more like amplification of the above: http://chat.stackexchange.com/transcript/message/38262392#38262392

tripleee commented 7 years ago

Here is now my attempt at a final proposal. I have not received any feedback on the limits so I leave them at the proposed numbers.

This is basically identical to the proposal from a month ago, with the amendment to allow for blacklisting domain names with substantial evidence from more than 6 months back but few recent hits. Also, the keyword blacklisting requirement is now at least two hits.

The bullet points with Rationale: are for background, and I don't think they need to be included in the wiki documentation eventually.

ArtOfCode- commented 7 years ago

@tripleee's suggestion works for me,and makes sense with the rationale included (thanks). I'm gonna wait until the US wakes up and has had a chance to read it, but other than that AFAIC this can go in the wiki.

AWegnerGitHub commented 7 years ago

At this point, I think we can move those guidelines to the wiki. It's been over a week with no objections.

tripleee commented 7 years ago

I edited the wiki pages by clicking "Edit on GitHub" but I'm not sure if something still needs to be done to make the changes visible on charcoal-se.org too.

ArtOfCode- commented 7 years ago

That's done and live now.