AI4Bharat / indicnlp_catalog

A collaborative catalog of NLP resources for Indic languages
https://ai4bharat.github.io/indicnlp_catalog
531 stars 77 forks source link

Multilingual Bias Detection and Mitigation for Indian Languages #242

Open anoopkunchukuttan opened 6 months ago

anoopkunchukuttan commented 6 months ago

Paper: https://arxiv.org/abs/2312.15181 Dataset: https://drive.google.com/file/d/169-yw7fKC-qB_wJ8Cwv8RusN67Uv7R7j/view

Translated from Wiki Neutrality Corpus and WikiBias English corpora to Indian languages using IndicTrans Contains parallel (biased, unbiased sentence pairs)

8 Indian languages: Hindi (hi), Marathi (mr), Bengali (bn), Gujarati (gu), Tamil (ta), Telugu (te) and Kannada (kn).

Overall, the total number of samples for classification are 287.6K and 280.0K for mWikiBias and mWNC respectively. To reduce training compute, we took a random sample from the overall bias mitigation data, leading to 39.4K and 39.0K paired samples in the mWikiBias and mWNC respectively.

Has the entire dataset ben released?