Multilingual Bias Detection and Mitigation for Indian Languages

Paper: https://arxiv.org/abs/2312.15181 Dataset: https://drive.google.com/file/d/169-yw7fKC-qB_wJ8Cwv8RusN67Uv7R7j/view

Translated from Wiki Neutrality Corpus and WikiBias English corpora to Indian languages using IndicTrans Contains parallel (biased, unbiased sentence pairs)

8 Indian languages: Hindi (hi), Marathi (mr), Bengali (bn), Gujarati (gu), Tamil (ta), Telugu (te) and Kannada (kn).

Overall, the total number of samples for classification are 287.6K and 280.0K for mWikiBias and mWNC respectively. To reduce training compute, we took a random sample from the overall bias mitigation data, leading to 39.4K and 39.0K paired samples in the mWikiBias and mWNC respectively.

Has the entire dataset ben released?

AI4Bharat / indicnlp_catalog

Multilingual Bias Detection and Mitigation for Indian Languages #242