facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

MBART Training: Missing mbart_large model architecture #2024

Closed shola-lawal closed 4 years ago

shola-lawal commented 4 years ago

Hi, I am trying to follow the mbart training step in fairseq/examples/mbart/README.md, however it appears that _mbartlarge is missing in the list of available installed fairseq model architectures.

Could you advise on where I can find the mbart.large model?

Below is the error that is generated: fairseq-train: error: argument --arch/-a: invalid choice: 'mbart_large' (choose from 'lightconv', 'lightconv_iwslt_de_en', 'lightconv_wmt_en_de', 'lightconv_wmt_en_de_big', 'lightconv_wmt_en_fr_big', 'lightconv_wmt_zh_en_big', 'transformer', 'transformer_iwslt_de_en', 'transformer_wmt_en_de', 'transformer_vaswani_wmt_en_de_big', 'transformer_vaswani_wmt_en_fr_big', 'transformer_wmt_en_de_big', 'transformer_wmt_en_de_big_t2t', 'transformer_align', 'transformer_wmt_en_de_big_align', 'nonautoregressive_transformer', 'nonautoregressive_transformer_wmt_en_de', 'iterative_nonautoregressive_transformer', 'iterative_nonautoregressive_transformer_wmt_en_de', 'levenshtein_transformer', 'levenshtein_transformer_wmt_en_de', 'levenshtein_transformer_vaswani_wmt_en_de_big', 'levenshtein_transformer_wmt_en_de_big', 'insertion_transformer', 'fconv', 'fconv_iwslt_de_en', 'fconv_wmt_en_ro', 'fconv_wmt_en_de', 'fconv_wmt_en_fr', 'lightconv_lm', 'lightconv_lm_gbw', 'bart_large', 'roberta', 'roberta_base', 'roberta_large', 'xlm', 'fconv_lm', 'fconv_lm_dauphin_wikitext103', 'fconv_lm_dauphin_gbw', 'fconv_self_att', 'fconv_self_att_wp', 'masked_lm', 'bert_base', 'bert_large', 'xlm_base', 'wav2vec', 'lstm', 'lstm_wiseman_iwslt_de_en', 'lstm_luong_wmt_en_de', 'cmlm_transformer', 'cmlm_transformer_wmt_en_de', 'multilingual_transformer', 'multilingual_transformer_iwslt_de_en', 'transformer_lm', 'transformer_lm_big', 'transformer_lm_baevski_wiki103', 'transformer_lm_wiki103', 'transformer_lm_baevski_gbw', 'transformer_lm_gbw', 'transformer_lm_gpt', 'transformer_lm_gpt2_small', 'transformer_lm_gpt2_medium', 'transformer_lm_gpt2_big', 'transformer_from_pretrained_xlm')

Thank you in advance.

kalyangvs commented 4 years ago

Please install fairseq from master and check, since last released version 0.9.0 is before mBART commit.

shola-lawal commented 4 years ago

Hi @gvskalyan,

Thank you for the feedback. It worked!!

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install .

One last question, I am translating en ->gu. I am not sure if I am interpretating the language tags below. Could you kindly explain what the en_XX, gu_IN and ar_AR values mean in context to langs variable below? langs=ar_AR,cs_CZ,de_DE,en_XX,es_XX,et_EE,fi_FI,fr_XX,gu_IN,hi_IN,it_IT,ja_XX,kk_KZ,ko_KR,lt_LT,lv_LV,my_MM,ne_NP,nl_XX,ro_RO,ru_RU,si_LK,tr_TR,vi_VN,zh_CN

Thanks again.

kalyangvs commented 4 years ago

I suppose, since few Languages are spoken in many countries across the world. In order to differentiate a language spoken in a country/ region, it is appended with the region code. en_XX - XX might mean english spoken in any or every region. gu_IN - gujarathi with Indian code. ar_AR - ar would mean arabic but I am unsure of which part it is, might mean Saudi Arabia. Source.

SunbowLiu commented 4 years ago

Hi @shola-lawal,

How do you prepare the preprocessed data for training and validation?