IndoNLP / nusa-writes

NusaWrites is an in-depth analysis of corpora collection strategy and a comprehensive language modeling benchmark for underrepresented and extremely low-resource Indonesian local languages.
Apache License 2.0
24 stars 2 forks source link

Inappropriate text and personally identifiable information #18

Closed altilunium closed 11 months ago

altilunium commented 1 year ago

While reviewing the dataset, I found several instances of inappropriate text with personally identifiable information still intact in the dataset.

Do you have any plans to remove this from the dataset?

SamuelCahyawijaya commented 11 months ago

@altilunium : for the nusa-kalimat (a.k.a. NusaTranslation) subset, we didn't make any changes from the original dataset source (i.e., [EmoT[(https://indonlp.github.io/nusa-catalogue/card.html?emot) and IndoLEM Sentiment). We have no plan to clean up the data as it will not reflect the resulting analysis in our paper.

However, we would like to encourage future researchers, especially those focusing on NLP applications, to exercise caution and be aware of the potentially toxic, profane, and other inappropriate contents that may exist in our dataset.

Hope it helps!