Closed superplane39 closed 7 years ago
A couple of comments here: https://chat.stackexchange.com/transcript/message/38892474#38892474
https://gist.github.com/tripleee/ab226f77b6deaf4ffea6d22d9b976beb contains 481 domain names extracted out of the currently 7505 hits from reason #106
There are probably a few stray domains with just a single hit -- let me know if I should try to process this further.
I think it would be a good idea, as long as all of the existing pharma domains in the blacklists are moved over into the new rule. Otherwise we would have two reasons which trigger on the same criteria, which is a bad idea.
On closer inspection the "repeated URL at end of long post" is not exclusively Indian pharma after all. There are hits from support telephone number spam, MP3 sites, Oracle training etc. But I'm hoping the gist would be useful as a starting point nevertheless.
Maybe there should be !!/pharm
and !!/unpharm
commands to modify the list.
Does this separation lose sensitivity for other valid spam detections? We don't want that...
@honnza How do you mean? Moving some domains from the general blacklist to a more focused high-precision blacklist should not lose any existing functionality.
Fun fact: Metasmoke ignores anything in parenthesis in reason names. We could have "bad keyword in body (pharma)" as a reason name. It'd do nothing to autoflagging, but would be searchable in the why data.
But I want to be able to find, search, manipulate, and organize these hits in Metasmoke.
Absolutely. We could append the original reason set to the why data to make searching possible.
I can search for "pharma" in "why" and that currently gets me 42/42. But the "why" data is currently awfully unstructured, and contains bits and pieces of the original post. (It's also hard to see which snippet corresponds to which reason. You see "bad keyword in body" and you can search the body hits in "why" and usually figure out which one corresponds to that reason, but it's not always straightforward.) What I'm hoping is that we could have a separate reason to make it easy and obvious how to list just the posts which belong to this set, and no others. It can be done via "why" but the way that it is currently (not) structured, avoiding false positives in the essentially free-form text is basically impossible without additional postprocessing.
For this reason, I'm hoping we could have a dedicated reason; this is a first-class Metasmoke identity that you can search unambiguously right from the Metasmoke search panel.
Granted, that would pollute the currently high-level and generic reasons hierarchy.
I can see two ways this could be avoided;
Tangentially, I'm also thinking Metasmoke v2 should have URLs and domains exposed, cataloged, tracked, indexed, etc. Maybe at that point at least we could figure out a way to collect related domains (pharma domains, support number domains, and why not phone numbers and email addresses) by some sort of tagging (named sets or just user-assigned free-form tags?)
Wouldn't be that hard to do retroactively. Just need to parse them out of the post text and store them.
Why was this closed? I think we should still pursue this as a separate reason somehow, and have been slowly working towards compiling a list of domain names which should be moved.
Closed because there's been no discussion in a month. We can still have further discussion here, but unless someone's actively going to work on it there's no point having the issue open.
We see a lot of Indian pharma spam. We know that. We also know that these spammers often rotate a lot of different sites that they spam.
I propose adding another reason - pharma spam site in {} - for a couple reasons.
1.) to raise the weight of these posts
If we have another reason, that has... I'm guessing it would be around 0 FPs, that would increase the weight. Increasing the weight is important, as that affects the autoflags (especially if we increase the number of flags depending on the weight - i.e. an 800 weight post would get 4 flags while 500 would get only 3).
2.) to keep a list of these sites
Why do we need a list? So that we can help get these taken down. @tripleee is working on submitting complaints (kudos!), and it's easier to submit complaints when you have a list of the problematic sites.
So - what do people think? Pros? Cons? Additional complications?