bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
74 stars 48 forks source link

Update flagged_words.py for Portuguese #393

Closed ruinunca closed 2 years ago

ruinunca commented 2 years ago

Removed some words that could be inserted in non-sexual scenarios. Also, changed words that in Brazilian Portuguese have a meaning and in European Portuguese have a different meaning (sexual in one language and non-sexual in the other).

HugoLaurencon commented 2 years ago

Hey @ruinunca thank you very much for your kind help!

There are still some words that I think I can understand that are a bit problematic, for example "clitoris", "penis", "vagina", "viagra" (can be used in medical contexts), "sexo" (if it literally means "sex" then we should remove it, otherwise it might make sense to make it), "violar" (can be mostly used in juridical contexts I guess). Plus the other words in Portuguese that I don't understand.

Also note that this is not useful to include English words likes "anal" or "porno" since the English flagged words are added to the Portuguese flagged words.

As a reminder, the guidelines were: Make a list of the most frequently words that appear in pornographic materials for this language. Keep only the words associated with porn and systematically used in a sexual context. Remove words that can be used in medical (sex, sexual), scientific (ejaculation, erection), colloquial (without referring systematically to porn) (boobs, bitch, cock, pussy, fuck), or everyday contexts (suck, swallow). Remove all insults (motherfucker, dickhead). Remove all words referring to race (white, black) or sexual orientation (gay, lesbian).

It would be amazing if you could review a little bit this list!