Brickimedia / brickimedia

Brickimedia Source Code
http://www.brickimedia.org
13 stars 4 forks source link

Spambots and CAPTCHA #449

Closed neoncitylights closed 8 years ago

neoncitylights commented 8 years ago

I'm not the best at handling these type of things and certainly not the most knowledgeable on this topic. I wanted to learn more of this topic however and believed that CAPTCHA could be a problem with our spambots (and also users too), so I spent a good hour or so doing research. Here's the most important question I asked myself: To keep CAPTCHA or not keep it? (I'll obviously try to spend more time into this interesting concept after this ticket is submitted here)

Remove!

Let's say we don't want to keep CAPTCHA. What would be our alternative to tell who's a human, and who's a spambot? Would we use a different one instead of QuestyCaptcha and say, move back to ReCAPTCHA or Asirra? Or actually use something entirely different? Here's the best option that I found.

Honeypot checkboxes

Checkboxes from, what I've read, seem to a graceful solution. You can hide it from the page with JS or CSS, so if it's checked there's a good chance a spambot checked it instead of a user (because regular users won't see it). Would it be possible to write an extension for this, or would it require dirtly hacking the MediaWiki core?

Keep!

Then let's say, we actually do want to keep the CAPTCHA. A number of issues would need to be fixed:

I don't like CAPTCHAs. From a user standpoint, I feel as if the site doesn't trust me. They can be really confusing, and trying multiples can lock me out of the sign up form, preventing me from creating an account, which would be the exact goal I was trying to achieve. I don't want to be required to answer things like this. Like this article said...

"Stopping spam should not come at the cost of stopping users from filling out your form."

Go honeypots!

References

adamrobcarter commented 8 years ago

How exactly does a checkbox work in this situation?

adamrobcarter commented 8 years ago

Also what about the new recaptcha where you just press a button?

adamrobcarter commented 8 years ago

https://www.mediawiki.org/wiki/Extension:ConfirmEdit#ReCaptcha_.28NoCaptcha.29

neoncitylights commented 8 years ago

Didn't know that existed but what you referenced is how a checkbox would work.

NovaFlare commented 8 years ago

I still don't know what's wrong with the idea I had like a year ago that got shot down when we got hit by a wave of spambots- just make a new filter with AbuseFilter getting new users to confirm their first edit...

edit: wait- we have CAPTCHA? Is this just for new/unregistered users?

chriscook18 commented 8 years ago

Is the checkbox the thing were if you click it in a "computer-like" manner, it then asks you to capatcha? I heard about something like that, but can't recall seeing it implemented anywhere I go.

neoncitylights commented 8 years ago

@NovaFlare Yeah, we've had QuestyCaptcha extension for the sign up form since August 6th, 2013 (https://github.com/Brickimedia/LocalSettings/commit/64431073dbd038f8b54c4165425285e2753c7c06), which is a little over 2 years and a couple months.

Also, I apologize, because I honestly don't recall hearing that idea :S I say it's a great plan though for the editing part, but the sign up part still needs better configuring. If we could prevent a lot less spambots from creating accounts successfully that'd be even better

@chriscook18 Yep. The idea is to make a checkbox and hide with either JS or CSS and only allow blank input. Spambots check the HTML structure of a site, and are probably like to add a checked attribute, but users won't fill it in because they don't see it. That's the reason why you don't see it. :P

However it does introduce accessibility issues like the current CAPTCHA we have:

This presents accessibility issues for screen reader users who have CSS disabled. If the label on the honeypot field doesn’t tell them not fill out the honeypot, they won’t know to avoid it. You could give the honeypot field a common label, such as “name”, to trick the spambot into filling it in. But it would also trick screen reader users to fill it in too. Honeypot captchas are not 100% effective at stopping spambots, nor are they accessible to all users. But they are far better for your forms, than traditional captchas. [reference]

mary-kate commented 8 years ago

Some thoughts on this...

CAPTCHAs -- at least the traditional "type in the blurry letters" etc. -- are fundamentally broken. This is hardly a secret; basically every time this discussion is taking place on the wikitech-l mailing list, someone brings up this point. And rightfully so, if I may add. Professional spammers have their ways to bypass these CAPTCHAs, either via software or by hiring real humans from poor third-world countries to essentially do their dirty work for them. Not to mention that as a user, these CAPTCHAs are very annoying and it seems that (at least with certain CAPTCHAs) they catch more legitimate users than spambots. In a MediaWiki context, this mostly means the FancyCaptcha stuff. Alas, in a complex farm setup (such as ShoutWiki), these are basically the only scalable solution.

MediaWiki already has a "honeypot checkbox" (or rather, a honeypot <input> field on edit pages). It started out as the SimpleAntiSpam extension and it was merged to core MW in 1.22.0. Of course given that that was a rather long time ago and SimpleAntiSpam itself was written way back in 2008 (!), it's reasonable to assume that spambots and other actively maintained spamming tools (such as the notorious XRumer tool) have learned to work around that. One possible solution would be to change the ID and/or name (etc.) attributes of this <input> to something random and unpredictable, but that might also require changing the respective MediaWiki i18n message...and at that point, we might be harming the one or two users who use Lynx to access Brickimedia one or two times a year. I don't think that's enough of a blocker, though.

Asirra and reCAPTCHA both relied/rely on a third-party service (Asirra was shut down in 2014), so that's essentially something to take into account. Some data is already being leaked to Google via AdSense ads, and to Quantcast via their tracking JS, but I think we should minimize our reliance on external service providers to the extent that it is reasonable -- ads are the one thing we can't easily (or rather, profitably) do ourselves; analytics and CAPTCHAs are something we're capable of handling on our own. Historically reCAPTCHA has been broken at least a few times, and even if the new version of reCAPTCHA would be "safe" enough...well, it's still annoying and confusing. "Click on all the pictures with green foods" or whatever sounds simple enough until you're presented with an array of low-resolution pics from which it's hard to figure out if the pic matches Google's definition of "green food" or not.

As much as I'm against proprietary solutions...anti-spam is one such case where you might have to consider, "who needs to access this code? will having this public harm my site/users more than it does good?". In 2008 Tim Starling wrote the AntiBot extension, to "allow for private development and limited collaboration on filters for common spam tools". Despite being used on Wikimedia sites, I'm not sure if there are any active(ly maintained) AntiBot plugins anymore. Years ago something called XRumerCookieBug.php existed, but I have no idea what it did, as its source code wasn't available to non-WMF staff users, and I've been told that it hasn't been used in years.

AntiBot isn't the only private anti-spambot solution for MediaWiki, though. For ShoutWiki I've written two unpublished extensions (EmailCheck and LastMeasure) to filter out certain kinds of harmful edits/other actions. Are they perfect? Definitely not. Do they filter out a fair amount of unwanted behavior? I believe so. ShoutWiki has almost 7k wikis and while every day some spambots make it past the filters, I'm confident in saying that the amount would be a lot higher without these extensions. If Brickimedia has interest in these tools, I'm more than happy to share the sources with @UltrasonicNXT and/or other parties, provided that we keep 'em out from public repositories for the time being.

So what should be done? Implement new anti-spam filters and enhance old ones; sadly I'm afraid we'll have to keep the CAPTCHA system but let's make it better. An on-wiki special page editor for editing QuestyCaptcha entries would be a significant improvement, as it'd allow removing the questions and their answers from the public LocalSettings files, and it'd also transfer some control from sysadmins to on-wiki admins, as it should be. Can we consider implementing that? I believe many external users of QuestyCaptcha would also love this.

tl,dr: I can't write quick summaries, but I have thoughts on anti-spam things based on my years of experience in combating spambots and other malicious users.

neoncitylights commented 8 years ago

@mary-kate Oh wow, I can't believe that the checkbox method is actually that old :S

Although I do support open-source projects, there are exceptions of when closed-source can be better, and I'd say this is one of them. They allow for collaboration and transparency, but that transparency can be abused. If spambots can access them, it's probably better to hide the code. The biggest con of this however is (like you said with AntiBot plugins) is whether or not it's actively developed. Then it's hard to know whether you should trust or should use such software.

But maybe..there could still be a little peak? It'd be a version change log - A small peak, but a peak that's enough to know "OK, people are looking at this code, they're changing it, and continuously improving it." There could be a schedule, say like every 4/5 days or every 1 week (7 days) for a new update - and that suggestion could be worked around with your guys schedules.

Or is what Thomas Grey said really true in this situation?

neoncitylights commented 8 years ago

Any reason why LEGO Ideas Wiki gets so many more hits compared to the other wikis? This is concerning. :cold_sweat: (It's actually been worse which was in around the late summer of 2015 where it wasn't just this many spambots being created but around 100+ spam articles created also daily, but this is the worst it has been most recently) http://ideas.brickimedia.org/index.php?title=Special:RecentChanges&limit=500&days=30 image

image

neoncitylights commented 8 years ago

I already addressed that on the admin wiki once at http://admin.brickimedia.org/wiki/Spambots_at_LEGO_Ideas_Wiki and I didn't receive much an answer besides CU'ing, globally blocking IP addresses and deleting the pages since I have admin rights. I've done it so much before however and they just keep on coming and it gets annoying if I do say so myself, and I'm sure LEGOSuperDKong is probably annoyed too. I disabled account creation in https://github.com/Brickimedia/LocalSettings/commit/a84a7dfd6015fb1a65e439250503619c242a7e38 for now but hopefully we can have a non-dirty solution worked out :+1:

mary-kate commented 8 years ago

Basically if and when you have a public (MediaWiki) wiki, the question is never "will my wiki end up on a spammer's hitlist?" but rather "when will my wiki end up on a spammer's hitlist?". This is the sad reality we live in.

That being said, the disaster we're looking at in that screenshot seems like a sum of many things. One of these things could be the aforementioned "bots can bypass our QuestyCaptcha questions". But that's not all. Special:NewPages on ideas is a mess, but looking at it the last legitimate article created before these spambot attacks was "First Period 2015" by LEGOSuperDKong on 03:07, 29 September 2015. After that the attacks started on 27 January 2016. Actually on that day, only one spam page was created (by this account). And then more and more spam, all over the place...

Some quick database queries against the shared user table suggest that our global email blacklist isn't working as intended. As recently as today, an account (UID #24112) was registered with a discardmail.com email address, despite the fact that that domain has been on the blacklist since the very creation of that blacklist page (29 May 2015!). @UltrasonicNXT, could you take a look at this and see if you're able to figure out what's going on in here? Is this just a local config issue of some kind or are we talking about a real code bug? (While sites like Wikimedia wikis etc. make a heavy use of SpamBlacklist's actual functionality, the spam blacklist for blocking certain external links, I'm not aware of too many sites that actually use the email blacklisting feature, so it's entirely possible that it hasn't been tested adequately and as such, something somewhere has subtly broken it.) I believe that fixing the email blacklist will decrease the amount of spam-only account registrations. Today 121 accounts have been registered, and judging by details such as username and email address, I'd say they are all spambot accounts. A working email blacklist would've stopped about ~20 of these accounts. Figuring out how to handle yahoo.com as well as the various .ru domains is a different challenge. On ShoutWiki the EmailCheck extension forces users of certain free email services (including, but not limited to, Yahoo!) to confirm their email before they're allowed to edit -- a small hindrance for both legitimate users and spambots, but it's something. I'm also thinking of requiring users who have a .ru email address to confirm their email address if the wiki's content language isn't Russian (ru) -- on Brickimedia this'd mean basically "if you have a *.ru email address, you need to confirm your email before being allowed to edit" because right now Brickimedia doesn't have any Russian wikis. Thoughts?

P.S. I would like to state for the record that discussing about things like these which require developer/sysadmin input on a closed wiki that not all of us devs/sysadmins can access isn't the best idea. Unless the definition of "admin" in "admin.brickimedia.org" is broadened to include developers/sysadmins, too.

LEGOSuperDKong commented 8 years ago

Actually, the attacks have been ongoing for months. Special:NewPages doesn't show any spam pages before January 27 because Codyn and I have been continually deleting them. January 27 just marks what I haven't cleaned up yet.

neoncitylights commented 8 years ago

@mary-kate As LSDK said, we just deleted them, so they should appear in the deletion log instead.

The idea for email checking sounds good. Question to earlier comment: Would it be possible to make a button on the special page editor that could send the questions to TranslateWiki.net? I'm not familiar with how that site works, or if they have an API so the data could be sent from our to that wiki, so insight would be great :smile:

Also, sorry Jack! I only posted a discussion that required sysadmins and devs like that once on the admin wiki and I haven't done it since (although I will still admit that it'd been way better if I introduced that onto GitHub)

(in case anyone is wondering, WAI-ARIA stands for Web Accessibility Initiative - Accessible Rich Internet Applications, and a11y stands for accessibility, similar acronyms to things like i18n - internationalisation or l10n - localisation, for longer words)

neoncitylights commented 8 years ago

@mary-kate @georgebarnick @UltrasonicNXT Would we benefit more from moving this discussion to currently existing tasks (or creating new tasks) to Phabricator, or keep our discussion on GitHub, or both? Having to track discussions on two site means more effort and might also make us lose track

mary-kate commented 8 years ago

For the record: we discussed a bit about the whole GitHub vs. Wikimedia hosting on Brickipedia's chat yesterday. While no consensus on what -- if anything -- we should do about it was estabilished, it's definitely something we should talk about in the future. Perhaps we should open a separate ticket for that debate?

On-topic...

I whacked a few spambots across multiple wikis. On Ideas one IP address (85.203.17.103) had been used to register 37 (!) spam accounts. While from a spammer's point of view, IP addresses are easy enough to acquire that discarding an IP after a single spam message is a viable strategy, we really shouldn't allow a gazillion registrations from the same IP (unless the person who's registering the accounts is at least somehow confirmed to be a human, or better yet, an administrator of some kind (yes, admins qualify as humans, too ;-)). The EmailCheck extension by ShoutWiki implements some measures to combat this kind of behavior and forces the spambots to get a bit more creative regarding IP addresses.

The original private, WMF-exclusive AntiBot plugins were also finally published, "only" about eight years after they were written. legoktm noted that they probably don't work.

There's also the StopForumSpam extension by legoktm and Skizzerz. It's used on Uncyclomedia sites (Uncyclopedia and Illogicopedia) and based on the IRC feed, it stops countless spam attempts every day.

tl,dr: We need to do something now. While "fixing" this bug is probably about as possible as fixing upstream bug #1 (nowadays task 2001), taking some steps to slow down the ongoing spam campaign is still very much possible and necessary.

neoncitylights commented 8 years ago

Yeah, I'll move that into a new ticket so it's easier to track and since it's of a different topic.

How would the EmailCheck extension be able to installed here? The only way I can figure out is to make it on a private repository, which would require getting a Bronze organisation plan, at 25 USD per month. Looks like a ridiculous price, considering we only need 1 private repo.

Also, it looks like StopForumSpam's latest update was almost a full 2 years ago, how is it guaranteed to work? That's a lot of time for spambots to figure out how to decrypt anti-spam methods.

mary-kate commented 8 years ago

I've asked @lewiscawte about setting up a new, restricted code repository on ShoutWiki's server. That way we don't have to pay GitHub just for the sake of one non-public repo and we're able to collaborate on the extension in a mutually beneficial way without having to expose the sources to potential spammers.

As for StopForumSpam, it's essentially an API to the StopForumSpam.com service. The StopForumSpam.com service is used by various people across the globe to report IPs/usernames/email addresses which were used to spam their site, be it a blog, a wiki or some other kind of a site. As far as I'm aware, the MediaWiki extension is perfectly functional and the only reason it hasn't been updated recently is that the extension is feature-complete. I'll let @legoktm and @Skizzerz chime in with any further input on this matter.

neoncitylights commented 8 years ago

Aha great, thanks :) to both of you @mary-kate and @lewiscawte please notify us when that has been finished :+1:

georgebarnick commented 8 years ago

Question, are we able to set the repository up on Brickimedia's server instead? Just for the sake of everything being together?

On Sun, Feb 7, 2016 at 1:17 PM Cody Nguyen notifications@github.com wrote:

Aha great, thanks :) to both of you @mary-kate https://github.com/mary-kate and @lewiscawte https://github.com/lewiscawte please notify us when that has been finished [image: :+1:]

— Reply to this email directly or view it on GitHub https://github.com/Brickimedia/brickimedia/issues/449#issuecomment-181071186 .

George Barnick https://www.georgebarnick.photo

neoncitylights commented 8 years ago

@georgebarnick I don't believe so, like I mentioned earlier the only idea I could come up with is buying the Bronze pricing plan for the organisation (which we shouldn't do). There could be other things I don't know about however

georgebarnick commented 8 years ago

If we're setting up git repositories on ShoutWiki's server we could do the exact same thing on our server

legoktm commented 8 years ago

As far as I'm aware, the MediaWiki extension is perfectly functional and the only reason it hasn't been updated recently is that the extension is feature-complete.

Yep, that. :-)

neoncitylights commented 8 years ago

Does anyone remember what it was like when we had Phalanx, an anti-spam extension? (I don't remember unfortunately but I know we had it on our wikis once)

@mary-kate: I know you wrote it and it says it has some specific codebits for ShoutWiki which you work for, but would it be worth it to rewrite it for wikis on general MediaWiki setups, or would that take too long? Is it still an effective method for anti-spam?

georgebarnick commented 8 years ago

It sucked and never worked or served any use

On Sun, Feb 7, 2016 at 10:51 PM Cody Nguyen notifications@github.com wrote:

Does anyone remember what it was like when we had Phalanx and uninstalled? (I don't remember unfortunately but I know we had it on our wikis once)

— Reply to this email directly or view it on GitHub https://github.com/Brickimedia/brickimedia/issues/449#issuecomment-181184652 .

George Barnick https://www.georgebarnick.photo

neoncitylights commented 8 years ago

Aha sorry for asking I actually do remember it now, and that we have AbuseFilter instead, we just need to improve on the current filters that have been written

georgebarnick commented 8 years ago

Spambot edits shouldn't be an issue for a while now on Brickimedia projects now that I've added a new abuse filter that's stopped 100% of spam actions since added. This doesn't prevent spambots from being created however, as @codynguyen1116 expressed concern about in his screenshots of LEGO Ideas Wiki recent changes, but it should render any and all spambot accounts useless. I don't necessarily see this as a permanent solution, and once an effective CAPTCHA is put in place, I think this new filter could be disabled as unnecessary at that point.

neoncitylights commented 8 years ago

Thank you, that filter will be a great addition to our wikis. :) :+1:

lewiscawte commented 8 years ago

While technically possible, you shouldn't. When @mary-kate says the ShoutWiki server, that's in reference to a server we have dedicated for little random things like this.

Prior to that recent purchase, we had an entire VPS dedicated to hosting our existing, private code base.

Setting up the serving for this took a fair amount of my time and sysadmin knowledge to not block all access to the machine.

Let's not continue to overload the one VPS Brickimedia has...

On Sun, 7 Feb 2016 18:42 George Barnick notifications@github.com wrote:

If we're setting up git repositories on ShoutWiki's server we could do the exact same thing on our server

— Reply to this email directly or view it on GitHub https://github.com/Brickimedia/brickimedia/issues/449#issuecomment-181076339 .

adamrobcarter commented 8 years ago

Warning: This filter was automatically disabled as a safety measure. It reached the limit of matching more than 5.00% of actions.

Also George can you explain exactly what this filter is doing? Looks a bit odd to me

georgebarnick commented 8 years ago

Prevents new users from adding external links that aren't to brickimedia.org, greatballcontraption.com, or wikia.com as their first couple edits. It effectively prevents 100% of spambots and has yet to have a false positive, and if it does it's easy enough for a new user to make 2 edits that don't include an external link to bypass the filter. No spambot does that however. The fact that it was matching more than 5% of actions just goes to show how much spam we'd otherwise be deleting...

NovaFlare commented 8 years ago

Any chance that having it as just Warn instead of Disallow would do the same thing but let any legitimate users share a link? I know most spambots give up after a warn, but I don't know if it blocks 100% of them. Good that we're not being hit by spambots in the last few days though :)

georgebarnick commented 8 years ago

Well part of the problem that the other filters were facing was that spambots were learning how to get past a mere warning and save their edits anyways. However, those filters can't be set to disallow because there's too much room for false positives, which is where this filter comes into play. No false positives, just a requirement that if someone wants to add a real external link that's outside the whitelisted domains (currently just brickimedia.org, greatballcontraption.com, and wikia.com), they have to make 2 other edits first. These could be as simple as adding a space to their user page twice, or actually making two contributions to articles. I can't imagine many people would be adding external links as their first two edits who couldn't wait till the third.

MacFan4000 commented 8 years ago

ok

neoncitylights commented 8 years ago

Sudden surge of new accounts at Brickipedia which look spambots, I checked the AbuseFilter log, and they've been triggering a lot of the filters. Any ideas on how to fix this?

MacFan4000 commented 8 years ago

I would go with ReCaptcha NoCaptcha. Also see https://en.brickimedia.org/wiki/Brickipedia:Forum#.28Somehow.29_stop_spambots_from_creating_accounts

neoncitylights commented 8 years ago

Superseded by https://phabricator.wikimedia.org/T136955