Translated from Wiki Neutrality Corpus and WikiBias English corpora to Indian languages using IndicTrans
Contains parallel (biased, unbiased sentence pairs)
8 Indian languages: Hindi (hi), Marathi (mr), Bengali (bn), Gujarati (gu), Tamil (ta), Telugu (te) and Kannada (kn).
Overall, the total number of samples for classification are 287.6K and 280.0K for mWikiBias and mWNC
respectively. To reduce training compute, we took a random sample from the overall bias mitigation data, leading to 39.4K and 39.0K paired samples in the mWikiBias and mWNC respectively.
Paper: https://arxiv.org/abs/2312.15181 Dataset: https://drive.google.com/file/d/169-yw7fKC-qB_wJ8Cwv8RusN67Uv7R7j/view
Translated from Wiki Neutrality Corpus and WikiBias English corpora to Indian languages using IndicTrans Contains parallel (biased, unbiased sentence pairs)
8 Indian languages: Hindi (hi), Marathi (mr), Bengali (bn), Gujarati (gu), Tamil (ta), Telugu (te) and Kannada (kn).
Overall, the total number of samples for classification are 287.6K and 280.0K for mWikiBias and mWNC respectively. To reduce training compute, we took a random sample from the overall bias mitigation data, leading to 39.4K and 39.0K paired samples in the mWikiBias and mWNC respectively.
Has the entire dataset ben released?