Profanity Filter - Githubissues

jharpster commented 6 years ago

Expand the profanity filters and make them multi-lingual.

Brief Description

The existing word list is inadequate to address more than the simplest profanities.

What is the motivation / use case for this feature?

Create more robust vandalism detection

What is the expected behaviour ?

Consider incorporating a broader list of profanities from this list.

mvexel commented 3 years ago

I notice that the connected MapRoulette Challenge has a very high number of tasks marked as Not an Issue (false positive). As the MapRoulette superuser I am getting some complaints about the tasks in this Challenge. I would recommend that we disable this MapRoulette Challenge until the quality of the filter can be improved. Thanks.

matkoniecz commented 3 years ago

One of glaring issues is that MapRoulette Challenge is not listing what is supposed to be a profanity.

So I have no idea is it a complete bug, pattern matching English profanities to text in other languages or something else.

Looking at it I am unable to spot what caused it to be reported, not sure which English profanity matched here. I have not seen a single valid report in Poland.

screen02

willemarcel commented 3 years ago

@mvexel Thanks for the feedback. I have stopped to update this challenge.

@matkoniecz I'll evaluate the possibility of improving or disabling the profanity filter on the next few days.

bugdebugger commented 3 years ago

@willemarcel

I too stumbled on this problem on MapRoulette

You can't tell why the node was tagged with the profanity tag
Most tasks are resolved as "Not an issue"
I did quite of few of them myself. All were false positives

So I went digging and figured the following things out. Some of this is probably obvious if you are familiar with OSM and the code around it. I wasn't :smile:

The tagging probably comes from mapbox/osm-compare --> profanity comparator
It uses word lists (forked from LDNOOBW 3 years ago and never updated)
- The lists are of questionable quality :expressionless:
In general the tag values (e.g. name) are checked against the word lists for multiple languages (default: en / es / de / fr / ru / zh)
- Only for tags with a language suffix it does the right thing e.g. name:es is only checked against the spanish word list
- A related issue on mapbox/osm-compare didn't go anywhere
It's easy to change the comparator to return all flagged words + the locale in which they are offending instead of true/false. But I don't know how that fits into the rest of the tech stack

Then I checked the word-lists in all languages I understand

some lists are ok and only contain actual rude profanities or very vulgar expressions :flushed:
but some of the lists also contain completely normal words e.g. first names, names of vegetables, numbers (!!)
- These words could be used (mostly in spoken language) to mean something related to e.g. reproductive organs, intercourse, etc. etc. But some are just ridiculous

The many false positives are caused by the combination of the above findings.

Some examples of what currently happens

Every node/way/line everywhere with a name tag where the value contains the number 13 or the name Peter will be flagged as profanity. Not by chance the screenshot of @matkoniecz is something with "13A" in the name tag
- 13 is in the chinese word list (ZH)
- Peter is in the french word list (FR)
Every chapel in Italy will be flagged as profanity
- As cappella (italian for chapel) is on the italian word list
While the english word list doesn't include "John", "Johnson", "Willie", "Willy" or even "Prick" etc. etc. :stuck_out_tongue_winking_eye:

matkoniecz commented 3 years ago

Thanks for the feedback. I have stopped to update this challenge.

Would it be possible to take it down completely or archive?

https://maproulette.org/browse/challenges?query=profanity

screen02

It would be worth saving time on manual marking 2800 entries as invalid by people using MR.

OSMCha / osmcha-frontend

Profanity Filter #237

Brief Description

What is the motivation / use case for this feature?

What is the expected behaviour ?