bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
74 stars 48 forks source link

further updates to flagged_words.py for indonesian #392

Closed jtboing closed 2 years ago

jtboing commented 2 years ago

I have made this updated list based on #390 with the addition of removing words that are used in a medical, scientific, and everyday contexts. I also removed insults and words referring to race and sexual orientation.

I still kept some words referring to genitals because all of these almost always refer to pornographic usage, and I've added some words fit the category based on my own knowledge and words in this link.

HugoLaurencon commented 2 years ago

Thank you!