Inappropriate text and personally identifiable information

IndoNLP / nusa-writes

NusaWrites is an in-depth analysis of corpora collection strategy and a comprehensive language modeling benchmark for underrepresented and extremely low-resource Indonesian local languages.

Apache License 2.0

24 stars 2 forks source link

@altilunium : for the nusa-kalimat (a.k.a. NusaTranslation) subset, we didn't make any changes from the original dataset source (i.e., [EmoT[(https://indonlp.github.io/nusa-catalogue/card.html?emot) and IndoLEM Sentiment). We have no plan to clean up the data as it will not reflect the resulting analysis in our paper.

However, we would like to encourage future researchers, especially those focusing on NLP applications, to exercise caution and be aware of the potentially toxic, profane, and other inappropriate contents that may exist in our dataset.

Hope it helps!

IndoNLP / nusa-writes

Inappropriate text and personally identifiable information #18