Closed jordimas closed 3 years ago
Hi, Jordi. Thank you for your valuable feedback. We really appreciate the time you're taking to look into the data closely. I'll do my best to address your points.
Does not reflect well how a human translator translate in these language pairs. [..] A human translator, will take the Spanish translation (source) and do the minimum changes to translate into Catalan (target), this includes preserving by default the same structure and vocabulary when there is no need to change.
This is a good point. Translating between close languages is often done not with English as a source. But note that this could be the result of practical reasons (e.g. there are more human translators for English->Spanish and Spanish-Catalan than English->Catalan), but that doesn't mean that the Catalan translations from English->Catalan are invalid.
While developing FLORES, we decided to start from English for consistency, thus all of our data is English-source translated. We established a rigorous process that made sure that native speakers judged these translations to be sure of their validity. As a result, Spanish and Catalan translations might not be 100% monotonically aligned. And that's fine.
Introduces bias and favors non rule-based machine translation (e.g. neural systems)
To be honest, we didn't design FLORES with one particular MT system in mind. Instead of a disadvantage, I would say that FLORES is leveling the playing field for non-rule-based systems. In the past, rule-based systems might have performed better because they assumed that the order between Spanish and Catalan is constant, but in reality it might not always be.
I any case, human evaluation (instead of only BLEU) would be great way to fairly assess those differences.
Note: I have very little time then I'm optimizing for sharing raw feedback based on visual observation. I had no time this week to evaluate this properly using data and a more quantitative approach, happy to help in some days from now.
Hypothesis
For languages that are from the same family (we will use Spanish to Catalan as example moving forward) the Flores dataset has potentially two problems:
a) Does not reflect well how a human translator translate in these language pairs b) Introduces bias and favors non rule-based machine translation (e.g. neural systems)
Let me elaborate both
Does not reflect well how a human translator translate in these language pairs.
Take as example from the dataset a sentence for Spanish and Catalan languages. Assume that you are evaluating Spanish to Catalan translation:
Spanish > El 7 de octubre, un motor se separó al despegar, sin dejar heridos. Rusia hizo permanecer en tierra los Il-76 por poco tiempo después de ese accidente.
Catalan > Un motor es va separar durant l'enlairament sense provocar ferits el set d'octubre. Rússia va fer aterrar ràpidament els Il-76 després d'aquell accident.
A human translator, will take the Spanish translation (source) and do the minimum changes to translate into Catalan (target), this includes preserving by default the same structure and vocabulary when there is no need to change.
For this sample sentence, this is how a human translator will translate from Spanish > Catalan:
Catalan > El 7 d'octubre, un motor es va separar al despegar, sense deixar ferits. Rússia va fer romandre en terra els Il-76 durant poc temps després d'aquest accident.
The core problem is that when a human translates from English to Spanish or from English to Catalan (languages from different families) needs to make some hard decisions because languages are very different. Different translators will take different decisions regarding vocabulary, grammar structure, etc. When you compare then Spanish to Catalan this not how a human will translate directly from Spanish to Catalan since you are pivoting over English and you will not make unnecessary grammar or vocabulary changes.
Introduces bias and favors non rule-based machine translation (e.g. neural systems)
Your evaluation sentences for Spanish and Catalan do not mimic how a human will translate Spanish to Catalan. They have the same meaning but the structure and vocabulary has change for no reason.
Rule-based machine translation systems like Apertium are very effective in languages that are from the same family. They apply transformation rules from source to target that mimic what a human will do.
If you use Flores dataset to evaluate rule-based systems you will in general score them lower, even if the translation is more accurate and closer to what a human will do. This is because the evaluation data set is not a translation from Spanish > Catalan, instead you have done English > Spanish and then English > Catalan and then you get a Spanish > Catalan which noise introduced by English.
This problems impacts language pairs like Spanish > Galician, Spanish > Catalan, French > Occitan, Spanish > Occitan, etc
Probing and quantifying this hypothesis
My suggestions
1) Quantify how many languages pairs may be effected by this problem
2) One way to probe this hypothesis and quantify the problem is to ask a human translator to translate from Spanish to Catalan directly and then compare (using spBLUE for example) how different this Spanish to Catalan translation done directly from the current ones in Flores done following English -> Catalan and English -> Spanish