Closed altilunium closed 11 months ago
@altilunium : for the nusa-kalimat
(a.k.a. NusaTranslation) subset, we didn't make any changes from the original dataset source (i.e., [EmoT[(https://indonlp.github.io/nusa-catalogue/card.html?emot) and IndoLEM Sentiment). We have no plan to clean up the data as it will not reflect the resulting analysis in our paper.
However, we would like to encourage future researchers, especially those focusing on NLP applications, to exercise caution and be aware of the potentially toxic, profane, and other inappropriate contents that may exist in our dataset.
Hope it helps!
While reviewing the dataset, I found several instances of inappropriate text with personally identifiable information still intact in the dataset.
Do you have any plans to remove this from the dataset?