Language code table and Inconsistency about the benchmark test set & support languages

velocityCavalry commented 3 years ago

Hi!

It will be nice if we could have a language code explanation table as some of them are missing. By language code, I am referring to the output of

>>> from transformers import MarianMTModel, MarianTokenizer
>>>tokenizer = MarianTokenizer.from_pretrained('pretrained_models/opus-mt-en-mul')  # en-mul as an example here
>>>print(tokenizer.supported_language_codes)
['>>ewe<<', '>>sna<<', '>>lin<<', '>>toi_Latn<<', '>>ceb<<', '>>oss<<', '>>run<<', '>>mfe<<', '>>ilo<<', '>>zlm_Latn<<', '>>pes<<', '>>smo<<', '>>hil<<', '>>niu<<', '>>sag<<', '>>fij<<', '>>cmn_Hans<<', '>>nya<<', '>>tso<<', '>>war<<', '>>gil<<', '>>hau_Latn<<', '>>umb<<', '>>glv<<', '>>tvl<<', '>>ton<<', '>>zul<<', '>>kal<<', '>>pag<<', '>>cmn_Hant<<', '>>pus<<', '>>abk<<', '>>pap<<', '>>hat<<', '>>mkd<<', '>>tuk_Latn<<', '>>yor<<', '>>tuk<<', '>>sqi<<', '>>tir<<', '>>mlg<<', '>>tur<<', '>>ido_Latn<<', '>>mai<<', '>>ibo<<', '>>srp_Cyrl<<', '>>srp_Latn<<', '>>kir_Cyrl<<', '>>heb<<', '>>bos_Latn<<', '>>bak<<', '>>ast<<', '>>som<<', '>>tah<<', '>>chv<<', '>>kek_Latn<<', '>>lug<<', '>>vie<<', '>>wln<<', '>>isl<<', '>>hye<<', '>>mah<<', '>>yue_Hant<<', '>>crh_Latn<<', '>>amh<<', '>>nds<<', '>>pan_Guru<<', '>>xho<<', '>>ukr<<', '>>cat<<', '>>afr<<', '>>tat<<', '>>guj<<', '>>jpn<<', '>>mon<<', '>>eus<<', '>>nob<<', '>>glg<<', '>>ind<<', '>>sin<<', '>>cym<<', '>>zho_Hant<<', '>>zho_Hans<<', '>>tgk_Cyrl<<', '>>aze_Latn<<', '>>ltz<<', '>>bod<<', '>>asm<<', '>>tel<<', '>>urd<<', '>>kaz_Cyrl<<', '>>lat_Latn<<', '>>gla<<', '>>kan<<', '>>bul<<', '>>kin<<', '>>ina_Latn<<', '>>ron<<', '>>spa<<', '>>csb_Latn<<', '>>iba<<', '>>tha<<', '>>nno<<', '>>hrv<<', '>>fry<<', '>>bre<<', '>>mar<<', '>>sme<<', '>>swe<<', '>>deu<<', '>>jav<<', '>>snd_Arab<<', '>>ben<<', '>>cmn<<', '>>ces<<', '>>ita<<', '>>fin<<', '>>por<<', '>>hin<<', '>>hun<<', '>>mal<<', '>>pol<<', '>>fra<<', '>>nld<<', '>>epo<<', '>>slv<<', '>>hsb<<', '>>kur_Latn<<', '>>ori<<', '>>tam<<', '>>bel<<', '>>dan<<', '>>ara<<', '>>mya<<', '>>rus<<', '>>mri<<', '>>est<<', '>>uzb_Latn<<', '>>lao<<', '>>yid<<', '>>uzb_Cyrl<<', '>>uig_Arab<<', '>>lit<<', '>>zho<<', '>>lav<<', '>>ell<<', '>>kat<<', '>>gle<<', '>>mlt<<', '>>khm<<', '>>oci<<', '>>kur_Arab<<', '>>ang_Latn<<', '>>kaz_Latn<<', '>>wol<<', '>>sun<<', '>>chr<<', '>>tat_Latn<<', '>>mhr<<', '>>tyv<<', '>>rom<<', '>>cha<<', '>>kab<<', '>>nav<<', '>>arg<<', '>>khm_Latn<<', '>>bul_Latn<<', '>>udm<<', '>>quc<<', '>>cor<<', '>>san_Deva<<', '>>fao<<', '>>bel_Latn<<', '>>jbo_Latn<<', '>>yue<<', '>>grn<<', '>>sco<<', '>>arq<<', '>>ltg<<', '>>yue_Hans<<', '>>min<<', '>>nan<<', '>>bam_Latn<<', '>>ido<<', '>>ile_Latn<<', '>>wuu<<', '>>crh<<', '>>tlh_Latn<<', '>>lzh<<', '>>jbo<<', '>>lzh_Hans<<', '>>vol_Latn<<', '>>lfn_Latn<<', '>>arz<<']

Moreover, I've noticed that in the opus-mt-en-mul model card, Malay (ms/msa) is not listed as a support language and I couldn't find it in the output above, however, in the Benchmark section, there is a test set listed as Tatoeba-test.eng-msa.eng.msa and its corresponding scores are listed. So I am a little bit confused... Is Malay supported in this model? If yes, how am I suppose to test the model to translate English to Malay, more specifically, what kind of prefix should I append before each sentence to translate sentences from English to Malay?

A similar situation can be found in opus-mt-en-gem where Norwegian (no/nor) is not listed as a supported language, however, Tatoeba-test.eng-nor.eng.nor is listed as a test set. The output is listed below:

>>> tokenizer = MarianTokenizer.from_pretrained('pretrained_models/opus-mt-en-gem')
>>> print(tokenizer.supported_language_codes)
['>>isl<<', '>>nob<<', '>>nds<<', '>>afr<<', '>>deu<<', '>>swe<<', '>>nno<<', '>>fry<<', '>>nld<<', '>>ltz<<', '>>dan<<', '>>yid<<', '>>ang_Latn<<', '>>fao<<', '>>sco<<']

Thank you so much!

jorgtied commented 3 years ago

Yes, I need to document that in a better way. The problem is that I use macro languages in the Tatoeba data labels and they may contain various individual languages. msa is one of them and it will include zlm and zsm among others.

For Norwegian, there are two variants as well: nno and nob. You need to use those labels if you want to produce one of those variants.

The confusion is also caused by the conversion from 3-letter codes to 2-letter codes in the modelnames at huggingface. They decided to stick to 2-letter codes even though the model is trained with the 3-letter ISO code in mind.

I hope this explains at least a bit the situation.

velocityCavalry commented 3 years ago

Yes, I need to document that in a better way. The problem is that I use macro languages in the Tatoeba data labels and they may contain various individual languages. msa is one of them and it will include zlm and zsm among others.

For Norwegian, there are two variants as well: nno and nob. You need to use those labels if you want to produce one of those variants.

The confusion is also caused by the conversion from 3-letter codes to 2-letter codes in the modelnames at huggingface. They decided to stick to 2-letter codes even though the model is trained with the 3-letter ISO code in mind.

I hope this explains at least a bit the situation.

It helps a lot! Thank you so much!

Helsinki-NLP / OPUS-MT-train

Language code table and Inconsistency about the benchmark test set & support languages #55