Closed Prasanta-Hembram closed 2 years ago
I fear that all datasets are having the same script name error for Santali language.
Thanks, investigating.
hello, we have confirmed that this is indeed a mislabeling. Our language identification, machine translation model, training data, evaluation data, and toxicity lists are in the OlChiki script of Santali. We will be updating the labels shortly and also update our writeup to reflect this.
Note that for the toxicity lists, the current one available is in Santali in Bengali script. We will be adding a list for Santali in OlChiki.
thank you so much for your feedback!
Hi, I have downloaded FLORES-200 from here, When I reviewed the data for Santali language where the file name was written as sat_Beng.devtest (where the script code represents Santali in Bengali script) in both dev and devtest folder, but when viewed the file all contents were in Ol Chiki script (sat_Olck). Hence, I request for the change in file name from sat_Beng.devtest to sat_Olck.devtest.