Request for the correction of Santali script name

facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark

Other

705 stars 123 forks source link

Request for the correction of Santali script name #52

Closed Prasanta-Hembram closed 2 years ago

Prasanta-Hembram commented 2 years ago

Hi, I have downloaded FLORES-200 from here, When I reviewed the data for Santali language where the file name was written as sat_Beng.devtest (where the script code represents Santali in Bengali script) in both dev and devtest folder, but when viewed the file all contents were in Ol Chiki script (sat_Olck). Hence, I request for the change in file name from sat_Beng.devtest to sat_Olck.devtest.

Prasanta-Hembram commented 2 years ago

I fear that all datasets are having the same script name error for Santali language.

huihuifan commented 2 years ago

Thanks, investigating.

huihuifan commented 2 years ago

hello, we have confirmed that this is indeed a mislabeling. Our language identification, machine translation model, training data, evaluation data, and toxicity lists are in the OlChiki script of Santali. We will be updating the labels shortly and also update our writeup to reflect this.

Note that for the toxicity lists, the current one available is in Santali in Bengali script. We will be adding a list for Santali in OlChiki.

thank you so much for your feedback!