Closed velocityCavalry closed 3 years ago
Yes, I need to document that in a better way. The problem is that I use macro languages in the Tatoeba data labels and they may contain various individual languages. msa is one of them and it will include zlm and zsm among others.
For Norwegian, there are two variants as well: nno and nob. You need to use those labels if you want to produce one of those variants.
The confusion is also caused by the conversion from 3-letter codes to 2-letter codes in the modelnames at huggingface. They decided to stick to 2-letter codes even though the model is trained with the 3-letter ISO code in mind.
I hope this explains at least a bit the situation.
Yes, I need to document that in a better way. The problem is that I use macro languages in the Tatoeba data labels and they may contain various individual languages. msa is one of them and it will include zlm and zsm among others.
For Norwegian, there are two variants as well: nno and nob. You need to use those labels if you want to produce one of those variants.
The confusion is also caused by the conversion from 3-letter codes to 2-letter codes in the modelnames at huggingface. They decided to stick to 2-letter codes even though the model is trained with the 3-letter ISO code in mind.
I hope this explains at least a bit the situation.
It helps a lot! Thank you so much!
Hi!
It will be nice if we could have a language code explanation table as some of them are missing. By language code, I am referring to the output of
Moreover, I've noticed that in the opus-mt-en-mul model card, Malay (ms/msa) is not listed as a support language and I couldn't find it in the output above, however, in the Benchmark section, there is a test set listed as
Tatoeba-test.eng-msa.eng.msa
and its corresponding scores are listed. So I am a little bit confused... Is Malay supported in this model? If yes, how am I suppose to test the model to translate English to Malay, more specifically, what kind of prefix should I append before each sentence to translate sentences from English to Malay?A similar situation can be found in opus-mt-en-gem where Norwegian (no/nor) is not listed as a supported language, however,
Tatoeba-test.eng-nor.eng.nor
is listed as a test set. The output is listed below:Thank you so much!