facebookresearch / flores

Facebook Low Resource (FLoRes) MT Benchmark
Other
705 stars 123 forks source link

Central Kurdish Problems #50

Open Sarchia opened 2 years ago

Sarchia commented 2 years ago

A few months ago I looked at the file of Central Kurdish and I noticed some problems and issues, now as the second version of the Flores is available, it seems like the same file of the first version is used again without any improvements.

The problems including using non-standard Unicode characters, inconsistent spellings, and wrong translations. For fixing wrong translations and spellings, you need to send the English and Central Kurdish files to a professional Kurdish translator or linguist in order to review it and fix the mistakes.

Regarding the problem of non-standard Unicode characters, which many sentences have this issue, fortunately there are tools (like this) to fix this easily. I hope you do this step for now, and if you could then send the file for reviewing.

huihuifan commented 2 years ago

We worked with two different groups of translators to produce the Central Kurdish FLORES dataset we currently have. Could you provide some examples of these errors, and any volume assessment you have? This can help us understand in more detail the suggested fixes, and we can go back to our translators.

Sarchia commented 2 years ago

Hello Angela @huihuifan

I write some examples with the issues here, which I randomly picked.

  1. Using non-standard unicode characters No. 13 لەم دواییەدا، بەرامبەر ڕۆنیك دۆڕا لە پاڵەوانێتی تێنسی کراوەی بریسبندا. No. 22 هسی وتیشی ما زیاتر شێواز بووە وەك لە مادە. No. 96 داواکاریەك دانرا بۆ لێکۆڵینەوە. No. 213 هیچ کەسێك لەناو شوقەکەدا نەبوو. No. 331 هیچ کەسێك هێندەی بۆبەك یاری نەکردووە یان گۆڵی تۆمار نەکردووە بۆ یانەکە.

  2. Incorrect and inconsistent spellings No. 56 بەڕێز کۆستێلۆ وتی کاتێك بەرهەمهێنانی وزەی ئەتۆمی لە ڕووی ئابوریەوە گونجاو دەبێت، ئوستورالیا هەوڵی بەکارهێنانی دەدات. No. 199 جگە لە تا و قورگ ئێشە، ئازاری ترم نیە و دەتونم لەڕێی تەلەفونەوە کارەکەم بکەم. No. 246 دڵخۆشم خەلکانك هەن ئامادەن پشتیوانیم لێبکەن. No. 335 ئینجا دەورێشە سەماکەرەکان هاتنە سەر ستەیج. No. 573 کۆمەڵگەی مێرولەی سەرباز بەکۆمەڵ دەرۆن و هێلانە دروست دەکەن بە قۆناغی جیاجیا.

  3. Incorrect and strange translations No. 214 لەو کاتەدا نزیکەی ١٠٠ دانیشتوان چۆڵکران لە ناوچەکەدا. No. 245 nan No. 304 خۆی کردۆتە بەڵگەنامە لە کتێبێکی ساڵی ١٩٩٨ دا. No. 585 مرۆڤ هەزاران ساڵە هاوێنە دروست دەکەن بۆ گەورەکردن. No. 909 کۆمپانیاکانی ڕاگەیاندن بەشێوەیەکی ڕۆتینی لەسەر ئامانجی ئەمە درۆدەکەن، بانگەشەی ئەوە دەکەن کە "ڕێگریلەدزیەکی"یە.

These are just a few examples and many more can be found. I can say that some of the translators of these sentences are not professional translators, it is clear some of them even don't know the correct spellings of Central Kurdish words and its grammar.

My suggestion is that to do the translation of the file again by a professional translator, or at least give it to someone to review and fix problems of the current file, in both cases a normalization step would be necessary too, and for that you can use this tool.

huihuifan commented 2 years ago

thank you! We are discussing with our Central Kurdish translators.

Sarchia commented 2 years ago

Thanks @huihuifan, I recommend to involve with a professional translator from outside too, because if you review it with the same translators who did the job, they my not notice and admit some mistakes. And if they have questions about any examples, I can tell what problems they have.

Regarding the problem of non-standard Unicode characters, I think most of people are not aware of it. In this case what is incorrect is that the translators used some old keyboard layouts which have the character "ك", this character is not used now, it is replaced by "ک". But even if they use the correct keyboards, I still recommend you to to the normalization step.

Also, I want to know if you have any plan to create a platform or a demo page for NLLB-200 model? I think as you are expanding your language coverage, it will be good to have a platform just like Google Translate or DeepL, so the speakers of all these languages be able to translate their texts.