IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Create dataset loader for Toxicity-200 #244

Closed SamuelCahyawijaya closed 1 year ago

SamuelCahyawijaya commented 2 years ago

NusaCatalogue: https://indonlp.github.io/nusa-catalogue/card.html?toxicity_200

Dataset toxicity_200
Description Toxicity-200 is a wordlist to detect toxicity in 200 languages. It contains files that include frequent words and phrases generally considered toxic because they represent: 1) frequently used profanities; 2) frequently used insults and hate speech terms, or language used to bully, denigrate, or demean; 3) pornographic terms; and 4) terms for body parts associated with sexual activity.
License CC-BY-NC 4.0
SamuelCahyawijaya commented 1 year ago

Hi @IvanHalimP: Sorry for the confusion. I just remember that we need password to open the zip file. The password should be tL4nLLb, you can check the password with unzip --password tL4nLLb [BCP47_code]_twl.zip

Reference: https://github.com/facebookresearch/flores/blob/main/toxicity/README.md

holylovenia commented 1 year ago

Hello, I'm labelling this issue as source-schema-only since it only provides a toxic wordlist (and we don't have a specified schema for wordlist/lexicon). Please implement the source schema for ind, ace, bjn, bug, jav languages with this feature structure: id and toxic_word.

IvanHalimP commented 1 year ago

self-assign