bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Update flagged_words.py for Indonesian #390

Closed afaji closed 2 years ago

afaji commented 2 years ago

Remove incorrect words, mostly because of a literal translation. Some of them are extremely common words that being used daily, such as bola (a ball), ayam (chicken--referring to animal/food but not the vulgar one), or terima kasih (thank you).

Also add more vulgar and profanity words (mainly collected from https://www.kaskus.co.id/thread/54d98d18118b468a558b4567/daftar-kata-kata-kotor-di-indonesia-yang-sering-diucapkan/as reference)

HugoLaurencon commented 2 years ago

Hi @afaji , thank you very much for your work!

There are still some words in this list that shouldn't be here, like "vagina", "homo" I guess, "nazi", "negro", etc... These words are important to remove, because we only want to keep words systematically used in pornographic contents, and not insults, or words that can appear in a medical context (but we still want to keep sexual practices like "blowjob" or "masturbation").

The rules we defined are the following: Keep only the words associated with porn and systematically used in a sexual context. Remove words that can be used in medical (sex, sexual), scientific (ejaculation, erection), colloquial (without referring systematically to porn) (boobs, bitch, cock, pussy, fuck), or everyday contexts (suck, swallow). Remove all insults (motherfucker, dickhead). Remove all words referring to race (white, black) or sexual orientation (gay, lesbian).

You can have an example by looking at the list for English here.

Is it ok for you to remove these words? Thanks in advance!!

HugoLaurencon commented 2 years ago

Sorry for closing and reopening, @jtboing updated the list here.

@jtboing do you agree with the following list added by afaji?

["bokep", "asu", "bangsat", "kampret", "memek", "ngentot", "ngewe", "onani", "coli", "colmek", "jembut", "perek", "pecun", "bencong", "banci", "jablay", "maho", "bego", "goblok", "idiot", "geblek", "sinting", "tolol" ]

jtboing commented 2 years ago

I think some of these words can be kept. But a lot are insults which should be removed. I can adjust this list for the suitable words or create a new pull request if you'd like.

HugoLaurencon commented 2 years ago

Sure, maybe you can directly commit on this, starting with your list and adding some correct words from afaji. If you don’t have the rights to do so, simply open a new PR. Thanks!!

jtboing commented 2 years ago

I couldn't commit on this PR, so I went ahead and made #395 with the appropriate changes. If it's possible please make afaji as co-author as well. Thanks.