Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
476 stars 182 forks source link

Should we have a 'pharma spam site' reason? #971

Closed superplane39 closed 7 years ago

superplane39 commented 7 years ago

We see a lot of Indian pharma spam. We know that. We also know that these spammers often rotate a lot of different sites that they spam.

I propose adding another reason - pharma spam site in {} - for a couple reasons.

1.) to raise the weight of these posts

If we have another reason, that has... I'm guessing it would be around 0 FPs, that would increase the weight. Increasing the weight is important, as that affects the autoflags (especially if we increase the number of flags depending on the weight - i.e. an 800 weight post would get 4 flags while 500 would get only 3).

2.) to keep a list of these sites

Why do we need a list? So that we can help get these taken down. @tripleee is working on submitting complaints (kudos!), and it's easier to submit complaints when you have a list of the problematic sites.

So - what do people think? Pros? Cons? Additional complications?

tripleee commented 7 years ago

A couple of comments here: https://chat.stackexchange.com/transcript/message/38892474#38892474

tripleee commented 7 years ago

https://gist.github.com/tripleee/ab226f77b6deaf4ffea6d22d9b976beb contains 481 domain names extracted out of the currently 7505 hits from reason #106

There are probably a few stray domains with just a single hit -- let me know if I should try to process this further.

angussidney commented 7 years ago

I think it would be a good idea, as long as all of the existing pharma domains in the blacklists are moved over into the new rule. Otherwise we would have two reasons which trigger on the same criteria, which is a bad idea.

tripleee commented 7 years ago

On closer inspection the "repeated URL at end of long post" is not exclusively Indian pharma after all. There are hits from support telephone number spam, MP3 sites, Oracle training etc. But I'm hoping the gist would be useful as a starting point nevertheless.

j-f1 commented 7 years ago

Maybe there should be !!/pharm and !!/unpharm commands to modify the list.

honnza commented 7 years ago

Does this separation lose sensitivity for other valid spam detections? We don't want that...

tripleee commented 7 years ago

@honnza How do you mean? Moving some domains from the general blacklist to a more focused high-precision blacklist should not lose any existing functionality.

Undo1 commented 7 years ago

Fun fact: Metasmoke ignores anything in parenthesis in reason names. We could have "bad keyword in body (pharma)" as a reason name. It'd do nothing to autoflagging, but would be searchable in the why data.

tripleee commented 7 years ago

But I want to be able to find, search, manipulate, and organize these hits in Metasmoke.

Undo1 commented 7 years ago

Absolutely. We could append the original reason set to the why data to make searching possible.

tripleee commented 7 years ago

I can search for "pharma" in "why" and that currently gets me 42/42. But the "why" data is currently awfully unstructured, and contains bits and pieces of the original post. (It's also hard to see which snippet corresponds to which reason. You see "bad keyword in body" and you can search the body hits in "why" and usually figure out which one corresponds to that reason, but it's not always straightforward.) What I'm hoping is that we could have a separate reason to make it easy and obvious how to list just the posts which belong to this set, and no others. It can be done via "why" but the way that it is currently (not) structured, avoiding false positives in the essentially free-form text is basically impossible without additional postprocessing.

For this reason, I'm hoping we could have a dedicated reason; this is a first-class Metasmoke identity that you can search unambiguously right from the Metasmoke search panel.

Granted, that would pollute the currently high-level and generic reasons hierarchy.

I can see two ways this could be avoided;

  1. Revamp "why", or replace it with a structured format which can be unambiguously searched and manipulated (and also improve the mapping between reasons and "why" indicators).
  2. Generalize the regex-based blacklists to "sets" (for lack of a better word) where each regex has a tag identifying which set it belongs to. There could be a hierarchy, like (ad hocking here, bear with) blacklist.website vs blacklist.website.pharma vs blacklist.website.supportnumber vs watch.website.pharma vs blacklist.keyword.pharma etc. I'm not entirely sure how this should be tagged, identified, and searchable in Metasmoke. As a first approximation, putting these basically machine-readable tags in "why" would at least make searching that data reasonably unambiguous.
tripleee commented 7 years ago

Tangentially, I'm also thinking Metasmoke v2 should have URLs and domains exposed, cataloged, tracked, indexed, etc. Maybe at that point at least we could figure out a way to collect related domains (pharma domains, support number domains, and why not phone numbers and email addresses) by some sort of tagging (named sets or just user-assigned free-form tags?)

Undo1 commented 7 years ago

Wouldn't be that hard to do retroactively. Just need to parse them out of the post text and store them.

tripleee commented 7 years ago

Why was this closed? I think we should still pursue this as a separate reason somehow, and have been slowly working towards compiling a list of domain names which should be moved.

ArtOfCode- commented 7 years ago

Closed because there's been no discussion in a month. We can still have further discussion here, but unless someone's actively going to work on it there's no point having the issue open.