LDNOOBW / List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words

List of Dirty, Naughty, Obscene, and Otherwise Bad Words
Creative Commons Attribution 4.0 International
2.9k stars 663 forks source link

New list format suggestion #11

Open muellermartin opened 10 years ago

muellermartin commented 10 years ago

I think the list format is too simple and is missing some features:

  1. weighting/score: some words are unmistakably insulting/bad/etc. but others are more inoffensive or even ambiguous, therefore there should be some weighting like a score from 1-10 (for very bad to slang)
  2. matching: most words have simple plural versions and some words have multiple variants (e.g. German umlauts ä, ö and u can be expressed as ae, oe and ue or the letter ß can be written ss) or letters can be left out. Maybe this should be left to the filter implementation, but this should be difficult, if you don't know the language
  3. grouping/categories: it would be nice to have some sort of grouping or categories like crime, violence, pornography, illegal drugs, insults etc. And it would be even nicer if words can be in multiple categories (because insults can be used in "dirty talk" or violence in crime…)

I'd suggest a rather simple format like CSV (comma separated values) with individual files for groups and the word lists, e.g.: The groups file with unique group IDs:

#0 should be reserved for uncategorized
crime;1
violence;2
insult;3
# …

And the word list with regular expressions (you can check them with RegExr or similar tools) optionally followed by the group IDs (can be left blank or set to 0 for uncategorized, multiple groups separated by ,) and the score from 1–10 (0 or empty for unrated):

cock(?!pit);;7 # This is a nice one: matches 'cock' but not 'cockpit' (uncategorized)
idiots?;3;7 # matches 'idiot' and 'idiots'
motherfucker;3;10
rap(e|ist|ing);1,2;6 # matches 'rape', 'rapist' and 'raping' but NOT 'rap'
# …

A small issue in this format is that matches are weighted the same, maybe sub-pattern matching could be used to rate each, but I don't know if this is needed (e.g. the pattern ((ass)(hole)?) results in three groups: ass, asshole and hole and multiple comma separated ratings apply to each group in order: ((ass)(hole)?);3;4,7).

Some of the ideas (weighting and groups) were taken from this list: http://contentfilter.futuragts.com/phraselists/

What do you think?

P.S.: Somehow I feel guilty for contributing to a filter/censorship list, but I think it can be useful to some extend to keep trolls and unconstructive discussions away. I hope these lists will be used responsibly…

patch commented 10 years ago

Those are some great suggestions.

  1. weighting/score: some words are unmistakably insulting/bad/etc. but others are more inoffensive or even ambiguous, therefore there should be some weighting like a score from 1-10 (for very bad to slang)

I like this idea because it's hard to determine exactly what should and should not be on the lists, and that also changes for different use cases. We would also need a value for undefined, because it may be difficult to find reviewers for all the languages.

  1. matching: most words have simple plural versions and some words have multiple variants (e.g. German umlauts ä, ö and u can be expressed as ae, oe and ue or the letter ß can be written ss) or letters can be left out. Maybe this should be left to the filter implementation, but this should be difficult, if you don't know the language

This has been discussed in the past and we've generally decided to only include entirely different word forms including tenses and declensions when appropriate. On the extreme end, multiple different Unicode normalization forms could be included, but instead we standardize on NFC and allow the filter to handle normalization. Additionally, ß and ss can be handled by Unicode case folding. For example, in Perl fc('Fuß') eq fc('FUSS'). Examples of Unicode normalization and case folding in many programming languages can be found in my Unicode Programming Examples project. This still doesn't cover German umlauts and I think that may be a good example for including the two different forms here due to the conversion being language-specific.

I'd suggest a rather simple format like CSV (comma separated values) with individual files for groups and the word lists

We've previously talked about alternate file formats and the most popular idea has been JSON due to almost every modern programming language including a standard JSON library for parsing.

P.S.: Somehow I feel guilty for contributing to a filter/censorship list, but I think it can be useful to some extend to keep trolls and unconstructive discussions away. I hope these lists will be used responsibly…

With an open source project like this, it could really be used for almost anything, but I can tell you that I don't use it for censorship or blocking content. I use it to avoid generating offensive text in autocomplete, machine translations, etc. In my opinion, it's much better to have open source projects like this so we can benefit from collaborating with each other. If people want to censor speech, they'll find a way with our without our help.

You mentioned some other ideas as well and I'll need to think more about those. I'm the primary maintainer for this project and am preparing for a three-week vacation, so we might not make much progress in the next few weeks.

Thanks again for all your help!

muellermartin commented 10 years ago

We would also need a value for undefined, because it may be difficult to find reviewers for all the languages.

I suggested 0 as unrated ;)

This has been discussed in the past and we've generally decided to only include entirely different word forms including tenses and declensions when appropriate.

My intention was to make better matches, because explicit words are often contained in other words and such filters are rarely set to match whole words which makes them match harmless words like classic, assumption, cockpit etc. But your suggestions on Unicode normalization sounds pretty cool and reminds me, that I need to take a closer look at Unicode ;)

We've previously talked about alternate file formats and the most popular idea has been JSON due to almost every modern programming language including a standard JSON library for parsing.

I like JSON, even if it adds a bit more syntactical overhead to the definition. YAML also could be a good choice!

Thanks for addressing my concerns. I agree that open sourcing and discussing these lists in public is the better way.

michilu commented 9 years ago

grouping/categories: it would be nice to have some sort of grouping or categories like crime, violence, pornography, illegal drugs, insults etc. And it would be even nicer if words can be in multiple categories (because insults can be used in "dirty talk" or violence in crime…)

It is nice idea. I want to get a category of the word for choice a level of the content filter.

Anniepoo commented 5 years ago

It would also be useful to have a flag for entries that are marginalized identities. These can particularly be double edged, as many LGBTQ identity terms are also used as slurs, but us queer folks are understandably not happy being told the word for who they are is obscene. Different use cases might include or exclude these.