IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.

Apache License 2.0

261 stars 61 forks source link

Create dataset loader for Toxicity-200 #244

Closed SamuelCahyawijaya closed 1 year ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?toxicity_200

Dataset	toxicity_200
Description	Toxicity-200 is a wordlist to detect toxicity in 200 languages. It contains files that include frequent words and phrases generally considered toxic because they represent: 1) frequently used profanities; 2) frequently used insults and hate speech terms, or language used to bully, denigrate, or demean; 3) pornographic terms; and 4) terms for body parts associated with sexual activity.
License	CC-BY-NC 4.0

SamuelCahyawijaya commented 1 year ago

Hi @IvanHalimP: Sorry for the confusion. I just remember that we need password to open the zip file. The password should be tL4nLLb, you can check the password with unzip --password tL4nLLb [BCP47_code]_twl.zip

Reference: https://github.com/facebookresearch/flores/blob/main/toxicity/README.md

holylovenia commented 1 year ago

Hello, I'm labelling this issue as source-schema-only since it only provides a toxic wordlist (and we don't have a specified schema for wordlist/lexicon). Please implement the source schema for ind, ace, bjn, bug, jav languages with this feature structure: id and toxic_word.

IvanHalimP commented 1 year ago

IndoNLP / nusa-crowd

Create dataset loader for Toxicity-200 #244

self-assign