swearword blocker incorrectly blocks some words containing non-English characters

Alys commented 4 years ago

The swearword blocker can prevent an innocent word being posted if the word contains:

a non-English letter that the blocker interprets as a word boundary and
a syllable that the blocker interprets as a swear word.

For example, the name "Assétou" cannot be posted in the Tavern or public guilds. The swearword blocker thinks that the three letters before the "é" character are a stand-alone word, and that word happens to be in the list of words that aren't allowed.

To replicate the bug:

Go to the Tavern on the website.
Try to make a post that contains "Assétou" (and optionally any other words).
Observe that your message is not posted and you see an error message ("Oops! Looks like this post contains a swearword, religious oath, or reference to an addictive substance or adult topic (Assé). Habitica has users from all backgrounds, so we keep our chat very clean. Feel free to edit your message so you can post it!")
You can repeat this test as often as you need to. There's no penalty for triggering this particular type of blocked message error.

The expected behaviour is that "Assétou" should be able to be posted with no error message.

However it should not be possible to post a swear word followed by a space or punctuation. For example, TESTPLACEHOLDERSWEARWORDHERE is a swear word (see below for why). You should not be able to post any of the examples below nor others like them that use any other punctuation marks:

" TESTPLACEHOLDERSWEARWORDHERE "
"TESTPLACEHOLDERSWEARWORDHERE."
"TESTPLACEHOLDERSWEARWORDHERE, "
"(TESTPLACEHOLDERSWEARWORDHERE)" That's currently working correctly (you currently cannot post those examples) and it needs to keep working like that.

You should be able to post this:

"TESTPLACEHOLDERSWEARWORDHEREé" (that's technically the same situation as the "Assétou" example)

I'm making this a high priority because it's not good that our swear word blocker is preventing legitimate non-English names being posted.

Below are details about how the swear word blocker is implemented.

The code that prevents swear words being posted is here: https://github.com/HabitRPG/habitica/blob/679c8f725b8487b25d9d4d0470e8dd5716e64459/website/server/controllers/api-v3/chat.js#L174-L181 and here: https://github.com/HabitRPG/habitica/blob/679c8f725b8487b25d9d4d0470e8dd5716e64459/website/server/controllers/api-v3/chat.js#L91-L93 and here: https://github.com/HabitRPG/habitica/blob/679c8f725b8487b25d9d4d0470e8dd5716e64459/website/server/libs/stringUtils.js#L5-L19

None of those files contain the list of banned swear words, so it's safe to read all those files.

The list of banned swear words is in the website/server/libs/bannedWords.js file (CONTENT WARNING: it contains unpleasant words!). The top part of that file is safe to read - it has comments explaining the format of the file, but all you really need to know is that the swear words are stored in an array called bannedWords, with one swear word in each array element. The array is exported with module.exports = bannedWords; Here's the important parts of that website/server/libs/bannedWords.js file:

const bannedWords = [
  'TESTPLACEHOLDERSWEARWORDHERE',
  'TESTPLACEHOLDERSWEARWORDHERE1',
  // swear words are here
];

export default bannedWords;

'TESTPLACEHOLDERSWEARWORDHERE' and 'TESTPLACEHOLDERSWEARWORDHERE1' are actual words in the website/server/libs/bannedWords.js file. They behave just like real swear words so if you try to post them to the Tavern you'll see the swear word warning. They're in that file so that you can test the swear word blocker without needing to use a real swear word. Using either of those test words will not cause your chat permissions to be removed.

NOTE:

You should not need to modify the website/server/libs/bannedWords.js file that contains the swear words. Please don't commit any changes to that file or let any swear words appear in your pull request or comments on this issue.

Habitica tries to keep all of its public spaces as safe places for people of all ages and backgrounds, and that includes the issues and pull requests in our GitHub repositories. We try to avoid swearing here just like we do in Habitica itself.

If you need to refer to swear words in examples, or test code, etc, you can use TESTPLACEHOLDERSWEARWORDHERE. For example, if you do git grep TESTPLACEHOLDERSWEARWORDHERE you'll see that some tests use that word.

Shuyinsama commented 4 years ago

This is, and will always be, a challenge I guess. Since in this case you would not like to censor since the word is not secluded or on its own.

But adding checks for this like saying if there is no space after or before. What to do with things like anTESTPLACEHOLDERSWEARWORDHERE or oTESTPLACEHOLDERSWEARWORDHERE. You still would like to match those situations. I'm not sure if the regular expression is gonna cut it.

p.s: Sorry for any potential profanity in this comment. It is kind of part of the discussion :)

Alys commented 4 years ago

@Shuyinsama Thanks for your comment! Just to let you know, I've edited your comment to replace the swearwords with TESTPLACEHOLDERSWEARWORDHERE and I've clarified the top post to indicate that it can be used in issue comments as well as PR comments. :) Sorry about the uncertainty there!

You're right that the examples you provided cause a moderation problem but there's no need for the swear word blocker to block them. We deal with them manually when posts containing them are reported to us. We actually don't want the blocker to handle them automatically because it would increase the number of legitimate words that can't be posted, especially since languages other than English are used in Habitica too.

For example, if Assétou's name were spelled "Assetou", we would not want that to be blocked.

A word from the list of swear words should be blocked only if it's surrounded by whitespace, start/end of the text string, or punctuation, ideally punctuation used in all locales. I guess another way of looking at it is that word boundaries should be locale-sensitive.

Shuyinsama commented 4 years ago

@Alys Thank you for the edits and the clarifications.

I did a quick mess around while I was on my way home from work https://regex101.com/r/zGD2Li/1

It seems we need to edit the regex a bit, this seems to be working but needs some more testing too with some various swearwords. I will be able to edit the files somewhere tomorrow in my local installation to test it agains multiple things.

Shuyinsama commented 4 years ago

It seems like my change was not enough and completely correct. The tests for the stringUtils fails on one word and it seems the sweawords are incorrectly banned in the tavern chat.

This requires some more work.

Alys commented 4 years ago

@Shuyinsama Thanks for continuing to look into this! I've marked it as in progress for you, but if you decide you'd rather not keep on with it, just let us know and we'll set it back to help wanted. You're also welcome to keep posting here if you have any questions or would like to keep discussing options!

Shuyinsama commented 4 years ago

@Alys Due to a personal leave I will be able to work on this in the weekend of July 4th.

I had another quick look today and this change may need some refactoring in the stringUtils file. It might need something more as I have a feeling a change to the regex will not be enough. This needs something more robust that can act as an actual swearword blocker. Before I dive in head first I first would like to investigate some options since this will affect quite a lot of things and is one of the generic features of Habitica. So I want to make sure that we choose the right path for this fix.

Are there any places I might need to check in the site too besides the Tavern? I guess I need to test all the chat screens but probably also things like Bio and name inputs for example?

Alys commented 4 years ago

@Shuyinsama The swear word blocker applies in the Tavern and almost all public guilds. The same code is used for all those locations so doing manual tests in just the Tavern on your local install should be sufficient but you could create a public guild locally too (you can be sure that any public guild you create locally will have the swear word blocker applied to it).

It also applies to Usernames, and the getMatchesByWordArray function is used for them too.

The swear word blocker doesn't currently apply to player's profile or Display Name but we have an issue for that and someone is working on it: https://github.com/HabitRPG/habitica/issues/11865

What I was originally thinking with this issue is that the regular expression could be adjusted to identify word boundaries in a more locale-aware way. There's some hints about making that possible at https://stackoverflow.com/questions/15861088/regex-to-match-only-language-chars-all-language and https://stackoverflow.com/questions/150033/regular-expression-to-match-non-ascii-characters/873600#873600 and probably other sites. I haven't looked into it in detail though so I'm not sure exactly what's possible. Let us know what you have in mind once you've looked at the options you've been thinking of! Thanks!

shanaqui commented 4 years ago

@Shuyinsama Hi there! Are you still working on this? If not, or if we don't hear from you within a week, I'll mark this issue as help wanted again. :)

shanaqui commented 4 years ago

@Shuyinsama I've marked this issue as "Help wanted" again so someone else can pick it up. You're welcome to pick it back up yourself as well, just let us know. :)

bigsee commented 4 years ago

@shanaqui I'd be happy to have a look if @Shuyinsama no longer has time?

Alys commented 4 years ago

@bigsee Thanks! Please go ahead and post here if you have questions or run into trouble!

bigsee commented 4 years ago

@Alys I managed to block my test user locally during testing and got chat privileges revoked. I've set up a second test account now anyway to keep working. Is there anything I can do to restore access to my test accounts if this happens locally or do I need to go through the same process as if it happened for real?

bigsee commented 4 years ago

Made a PR for this 👆🏽

SabreCat commented 4 years ago

@bigsee Chat muting is controlled by flags.chatRevoked, and login ban is auth.blocked, both in the user record. You can set those values as needed in your local database, when testing!

Alys commented 4 years ago

@bigsee You can also make one of your local accounts an admin and then you can use the Hall of Heroes to control muting, blocking, etc in all user accounts. See the "Testing your Change Manually" section in Using Your Local Install to Modify Habitica's Website and API - the debug menu lets you make an account an admin (see the final bullet point in that section for details).

bigsee commented 4 years ago

@SabreCat @Alys thank you for the tips!

shanaqui commented 4 years ago

Hi @bigsee, are you still working on this issue? If not, or if you don't reply in a week, I'll mark it as help wanted again just to make sure it doesn't go stale. :)

shanaqui commented 4 years ago

Since I didn't get a reply, I'm marking this as help wanted again -- but feel free to let us know if you'd like to pick it back up, if no one else volunteers!

bigsee commented 4 years ago

Hi @shanaqui - as with the other related ticket, the work on this was done and had a PR in. I think from memory it just needed a test adding so perhaps you could pass it back and I'll add that? Let me know if someone else has taken over though. 👍

Alys commented 4 years ago

@bigsee I've marked this as in progress for you again. Thanks!

bigsee commented 4 years ago

@Alys those tests are added now and a link to the relevant Unicode table as discussed with @paglias. Apologies again for the delay folks... 🙏🏽

HabitRPG / habitica

swearword blocker incorrectly blocks some words containing non-English characters #12309