Helsinki-NLP / OPUS-MT-train

Training open neural machine translation models
MIT License
318 stars 40 forks source link

[Language Codes] How are models named? #7

Closed sshleifer closed 4 years ago

sshleifer commented 4 years ago

For example, In, cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh-de Is there a table or some other source for what zh_HK, zh_yue, yue, etc. represent? Is zh_yue is different than yue? is zh_cn different than cn somehow?

Thanks in advance!

jorgtied commented 4 years ago

This is a problem of OPUS which is, unfortunately, not always very consistent in the use of language IDs. Some corpora use different variants of regional variants or upper-case letters in the region code ro 3-letter language codes instead of 2-letter ISO-639-1 codes that I use otherwise. The thing is that I should fix consistencies in OPUS and that is a major thing to do. An intermediate solution would be to find a mapping before training the OPUS-MT models. Also not very easy ....

sshleifer commented 4 years ago

Would you mind if on the transformers side we named that model zh-de or zh_group-de, and document the original name on in a model card? Do you foresee a naming collision?

jorgtied commented 4 years ago

Difficult to say whether this will be the best solution. There is not always a straightforward group name and there may be slightly different selections in something that could end up with a similar or the same group name. For example, assume that I train a new multilingual model for Celtic languages to English and I include or exclude a language for various reasons (too little data, new data ...) compared to previous models. That could be a problem if the model is called celtic-en.

By the way, how will you handle new versions of the same model? I use the package date to name the models. How will this be done at huggingface?

sshleifer commented 4 years ago
jorgtied commented 4 years ago

I will try to see if I can use some other naming conventions in the future. I could possibly introduce some kind of reasonable group names as well. Having all the language IDs in the path name make it easier to see explicitly the languages that are supported but I can understand that this becomes a bit tedious as well.

about 2: how do you handle new versions of other models at this point? I can assume that this happens for other LMs as well as things improve and change rapidly. Or do people simply create a new model with a completely new name? For example, improved language-specific models might get some updates etc. At least for the MT models I will try to release as quickly as possible to make things available but that may lead to various updates of existing models. Is that a problem?

sshleifer commented 4 years ago

Group names would be lovely, maybe put the language IDS/link to training data in the README.md?

For the already trained models, I propose the following names:

The format is group_name <- list_of_langs:

'zh_group' <- 'cmn+cn+yue+ze_zh+zh_cn+zh_CN+zh_HK+zh_tw+zh_TW+zh_yue+zhs+zht+zh'

'romance_group' <- 'fr+fr_BE+fr_CA+fr_FR+wa+frp+oc+ca+rm+lld+fur+lij+lmo+es+es_AR+es_CL+es_CO+es_CR+es_DO+es_EC+es_ES+es_GT+es_HN+es_MX+es_NI+es_PA+es_PE+es_PR+es_SV+es_UY+es_VE+pt+pt_br+pt_BR+pt_PT+gl+lad+an+mwl+it+it_IT+co+nap+scn+vec+sc+ro+la'

'scandinavia_group' <- = 'da+fo+is+no+nb+nn+sv-da+fo+is+no+nb+nn+sv'
'north_eu_group' <- 'de+nl+fy+af+da+fo+is+no+nb+nn+sv-de+nl+fy+af+da+fo+is+no+nb+nn+sv'
'en_el_es_fi'<- 'en+el+es+fi'
'sami' <- 'se+sma+smj+smn+sms'
'norway' <- 'nb_NO+nb+nn_NO+nn+nog+no_nb+no'
'celtic' <- 'ga+cy+br+gd+kw+gv' (see  https://en.wikipedia.org/wiki/Insular_Celtic_languages)

Versioning: We don't have a standard way. It's up to the poster Some people just tag them with v1 e.g Musixmatch/umberto-commoncrawl-cased-v1, but I could add some aliasing code to make it more like a pip install, where users get the latest, but people who want compatibility can pin stuff.

Updating: Nothing will auto-update at the moment. I can run the update job whenever you'd like (in overwrite mode).