Charcoal-SE / SmokeDetector

Headless chatbot that detects spam and posts links to it in chatrooms for quick deletion.
https://metasmoke.erwaysoftware.com
Apache License 2.0
474 stars 182 forks source link

Add command to show flagged posts that got not deleted yet (or post automatically) #226

Closed ByteCommander closed 7 years ago

ByteCommander commented 8 years ago

Sometimes spam reports about smaller sites get less attention than they need, either if they're followed by many other reports or if only few people are online at the time.

I would suggest that Smokey should keep a list of all reported posts that got positive feedback or no feedback at all yet and are not yet removed from the site. A command like !!/pending would then show a list of all those reports that still need more flags or feedback. Example:

"Skin care tips" by "SpamUser" on webmsaters.stackexchange.com [MS] (reported 12 minutes ago, 1 tp, 0 naa, 0 fp, post score -3)
"Best essay writing service" by "Writer" on graphicdesign.stackexchange.com [MS] (reported 6 minutes ago, no feedback yet, post score -1)

This would be very helpful to make sure no reports slip through and to verify if anything needs more flags after a bunch of reports appeared without having to walk through the links manually.

Additionally, it might be useful to not only post this report on demand but also automatically for posts in the list that were reported more than e.g. 10 minutes ago.

ArtOfCode- commented 8 years ago

Might make sense to do this as an MS API route, with Smokey just requesting that when the command gets hit.

csnardi commented 8 years ago

I don't really like this idea. Anything we make would be highly highly inaccurate -- deletion data is recorded a lot of the time, but not if a restart occurs, not if the post was deleted too quickly, and not if the post was deleted 20+ minutes after being reported. Also, if a post is not deleted after 10-15 minutes, it's probably fairly borderline. Most blatant spam will be deleted within 10-15 minutes; I'm not sure we really want to encourage spam flags on non-blatant spam.

AWegnerGitHub commented 8 years ago

I agree. I think this is going to generate a lot of noise and false positives

On Sep 30, 2016 2:19 PM, "hichris1234" notifications@github.com wrote:

I don't really like this idea. Anything we make would be highly highly inaccurate -- deletion data is recorded a lot of the time, but not if a restart occurs, not if the post was deleted too quickly, and not if the post was deleted 20+ minutes after being reported. Also, if a post is not deleted after 10-15 minutes, it's probably fairly borderline. Most blatant spam will be deleted within 10-15 minutes; I'm not sure we really want to encourage spam flags on non-blatant spam.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Charcoal-SE/SmokeDetector/issues/226#issuecomment-250830081, or mute the thread https://github.com/notifications/unsubscribe-auth/AGKb52opmN9vUtSgVPbY7KtclAMMJg_yks5qvWC0gaJpZM4KK-qx .

ByteCommander commented 8 years ago

@hichris1234 I can't really follow your technical argument, what is the problem with keeping a list of reports, their feedback, and polling the deletion status of spam candidates?

And again, there are not only the reports that get overseen because they are old, but also during rush hour, when Smokey reports a dozen posts in short time.

I see your doubts because it might encourage robo-flagging of false positives, maybe the query should only be exposed on Metasmoke then and not as Smokey command. That would limit the number of people who see the results to a smaller circle of mostly experienced and responsible spam flaggers.

angussidney commented 8 years ago

This is a great idea; I've been thinking about something similar myself for a couple of weeks, but wasn't sure how it would be implemented. Though the command idea makes a lot of sense.

AFAIK, MS already keeps a record of post deletion data, so it shouldn't be too hard to implement

csnardi commented 8 years ago

@ByteCommander We do have deletion data to some extent. It's just not completely accurate -- it was never designed to be. I'm not too convinced that this is a problem we need to solve. I think having a command like this would be noisy -- and do note that only about 30 meaningful reports/day haven't been deleted after 5 minutes.

Undo1 commented 8 years ago

As hard data, we have metasmoke deletion log records for 910 of the last 1000 posts. It'd be fairly easy to make that >99%; we could duplicate the deletion websocket on metasmoke among other possibilities. I wouldn't let technical considerations kill this, we can address those.

As for social issues, I'm not yet qualified to comment. Need to look at it more.

csnardi commented 8 years ago

Deletion log records? What exactly does that mean? Because we just throw up our arms after 20 minutes and say "this post wasn't deleted": https://github.com/Charcoal-SE/SmokeDetector/blob/master/deletionwatcher.py#L62. Even though probably some of those posts were deleted.

Undo1 commented 8 years ago

@hichris1234 Means 'there's a DeletionLog attached to the Post'. Of the last 1k true positives, 768 have a DeletionLog indicating it was deleted.

We could overcome that by running websockets on metasmoke, of course, and it wouldn't be too hard if we wanted to do it.

ArtOfCode- commented 8 years ago

Seems to me that we're slowly growing into a pattern of rejecting otherwise-good ideas because our current technical status doesn't support them. Come on, we're programmers, technical limitations really don't matter. If someone's had a good idea, let's evaluate it based on its potential benefit/risk to us, rather than "oh but we can't do that because X" - we can do it, someone just needs to write a bit of code.

And now down off my soapbox...

I think this idea is a good one. I went ahead and implemented the API route on metasmoke to get the data; that's deployed and (I think) working. I also had a go at the Smokey command, but it looks like I've misunderstood how commands work so that just fails. If we're doing it, someone else is going to need to do it.

I see the potential benefit as greater coverage for making sure spam gets deleted. I see the potential risk as there being a slightly increased potential for robo-flagging; I don't think that's a big problem because Smokey in itself is highly at risk of robo-flagging, but that's never been a problem for us.

csnardi commented 8 years ago

Technical limitations don't matter, sure, but we have to evaluate the cost of implementing an accurate system. Personally, I think that cost is more to us than the benefit of this command would be.

Currently, any way this is implemented will be inaccurate. We'd have to check every single post that we didn't record as deleted to see if it was deleted -- probably on the metasmoke side. Is it worth that cost?

And is this a problem we really need to be solving? I don't think so. Honestly, the easiest way to implement this is just click the last 10 reports and see if they need more flags. That's what I do -- it's quick, easy, and doesn't generate much noise. And if something is long-lasting and obvious spam, well, you can post a message saying that. And, as I mentioned above, only about 30 reports aren't deleted within the first 5-7 minutes. This command would only be useful to a very limited amount of reports -- and I can't see the usefulness of running it every so often rather than just spending 10 seconds clicking each report.

This isn't an issue of technical limitations. This is an issue of is it worth it and should we do it -- which for me, is personally a no.

Undo1 commented 8 years ago

@hichris1234 I'd disagree - it'd be cheap (from an AWS-resources point of view) on the metasmoke side, and I've been looking for something to work on in metasmoke anyway so my time is free.

I see this being useful during the peak spam time, which happens to be the time when I'm sleepy. Sleepy me doesn't like clicking on the last 10 reports just to get 404s on most of them. I also don't like doing that while I'm on a crazy-restrictive datacap like I am now.

Anyway, for me, it seems worth it. Laziness is good.

ByteCommander commented 8 years ago

Laziness is the programmer's greatest motivation. :wink:

I agree with @Undo1 here, if a few dozen people have to click through the last 10 reports to get a 404 most of the time just to catch posts that potentially need more attention, I see lots of potential for laziness there.

@ArtOfCode- You said you already implemented something in Metasmoke - can we see something on the site already or is it just an API function without UI?

ArtOfCode- commented 8 years ago

@ByteCommander It was an API route; it's also rather broken at the moment (read: queries were taking 3 minutes to execute).

Wrzlprmft commented 7 years ago

Today, a blatant spam post on Graphic Design survived for two hours due to being buried under reports from bigger sites, and this is not the first time I observed something like this. I usually note these cases when reviewing on the respective sites. This may happen more often than we are aware of.

@hichris1234 : 30 reports per day is a lot given that we have about a hundred reports per day (at least according to Metasmoke).

To avoid nuking false positives, we can exclude everything that got any feedback other than spam/abusive, i.e., FP, NAA, vandalism, …

NobodyNada commented 7 years ago

To get around deletion data on Metasmoke being incomplete, we can always do another API request to check if the post has been deleted immediately before posting it to chat.

ghost commented 7 years ago

@Wrzlprmft my idea is seperate smoke room per site

Undo1 commented 7 years ago

@markyi370 We can already do that, and we do for some sites that request it.

Wrzlprmft commented 7 years ago

@markyi370 How would this solve this issue?

teward commented 7 years ago

@markyi370 That doesn't solve the core problem. And we already implement this, i.e. SOCVR and the notices for Ask Ubuntu being CC'd to the Ask Ubuntu General Room, and other cases as well. This doesn't solve the core issue though.

ghost commented 7 years ago

Which is better, using MS or having Smokey manually keep the database itself? I say that keeping it on Smokey is better because it is then local.

The only issue with Smokey that I see is that if we push something to the blacklist, when it restarts, it will lose that list.

Solution: Keep the list in a file, manually adding and removing entries as needed.

I'm no Python expert, but maybe something like this?

http://stackoverflow.com/q/1989251/6754053

Just my thoughts on the matter.

NobodyNada commented 7 years ago

@markyi370 The blacklist is already stored on Smokey and on GitHub. Are you suggesting also storing the reported posts on Smokey? I don't think that would be a good idea, because:

ghost commented 7 years ago

So we keep the database on MS, and fetch data on request.

teward commented 7 years ago

@markyi370 We already do this. I'm confused what you're suggesting we do here - the deletion watcher and such all sits on SmokeDetector...

angussidney commented 7 years ago

The 'Autoflagging Information and More' userscript takes care of this in a less noisy fashion, so this request is now redundant.