facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
29.8k stars 6.31k forks source link

NLLB inference - short tutorial #5292

Open gordicaleksa opened 10 months ago

gordicaleksa commented 10 months ago

As a ton of people don't know how to use NLLB project here is a short tutorial to get it running directly through fairseq!

  1. Follow the installation instructions here. For the most part it worked for me, I had to manually pip install certain packages like: regex, six, openpyxl, translate-toolkit, xxhash, fasttext depending on the part of the NLLB codebase I was playing with.

  2. Download SPM-200 model & dictionary. You'll find it in the reparing the data README.

  3. Download NLB checkpoints from the modeling README. With a 12 GB VRAM GPU you'll be able to run all dense checkpoints (up to 3.3B) using fp16. MoE 54B checkpoint has ~400 GBs so I won't focus on it here as most people can't run it - and if you can you probably know how to do this eitherway. :)

  4. Use the fairseq_cli/interactive.py script for inference.

  5. Pass in these arguments (note: I use vscode so you can simply place this under launch.json config):

    "args": [
        "/dummypath",
        "--input", "/home/aleksa/Projects/nllb/fairseq/examples/nllb/data/tico/eng-spa/tico19.eng-spa.eng",
        "--source-dict", "/home/aleksa/Projects/nllb/fairseq/dictionary.txt",
        "--target-dict", "/home/aleksa/Projects/nllb/fairseq/dictionary.txt",
        "--path", "/home/aleksa/Projects/nllb/fairseq/model_checkpoints/checkpoint_3.3B.pt",
        "--task", "translation_multi_simple_epoch",
        "--langs", "ace_Arab,ace_Latn,acm_Arab,acq_Arab,aeb_Arab,afr_Latn,ajp_Arab,aka_Latn,amh_Ethi,apc_Arab,arb_Arab,ars_Arab,ary_Arab,arz_Arab,asm_Beng,ast_Latn,awa_Deva,ayr_Latn,azb_Arab,azj_Latn,bak_Cyrl,bam_Latn,ban_Latn,bel_Cyrl,bem_Latn,ben_Beng,bho_Deva,bjn_Arab,bjn_Latn,bod_Tibt,bos_Latn,bug_Latn,bul_Cyrl,cat_Latn,ceb_Latn,ces_Latn,cjk_Latn,ckb_Arab,crh_Latn,cym_Latn,dan_Latn,deu_Latn,dik_Latn,dyu_Latn,dzo_Tibt,ell_Grek,eng_Latn,epo_Latn,est_Latn,eus_Latn,ewe_Latn,fao_Latn,pes_Arab,fij_Latn,fin_Latn,fon_Latn,fra_Latn,fur_Latn,fuv_Latn,gla_Latn,gle_Latn,glg_Latn,grn_Latn,guj_Gujr,hat_Latn,hau_Latn,heb_Hebr,hin_Deva,hne_Deva,hrv_Latn,hun_Latn,hye_Armn,ibo_Latn,ilo_Latn,ind_Latn,isl_Latn,ita_Latn,jav_Latn,jpn_Jpan,kab_Latn,kac_Latn,kam_Latn,kan_Knda,kas_Arab,kas_Deva,kat_Geor,knc_Arab,knc_Latn,kaz_Cyrl,kbp_Latn,kea_Latn,khm_Khmr,kik_Latn,kin_Latn,kir_Cyrl,kmb_Latn,kon_Latn,kor_Hang,kmr_Latn,lao_Laoo,lvs_Latn,lij_Latn,lim_Latn,lin_Latn,lit_Latn,lmo_Latn,ltg_Latn,ltz_Latn,lua_Latn,lug_Latn,luo_Latn,lus_Latn,mag_Deva,mai_Deva,mal_Mlym,mar_Deva,min_Latn,mkd_Cyrl,plt_Latn,mlt_Latn,mni_Beng,khk_Cyrl,mos_Latn,mri_Latn,zsm_Latn,mya_Mymr,nld_Latn,nno_Latn,nob_Latn,npi_Deva,nso_Latn,nus_Latn,nya_Latn,oci_Latn,gaz_Latn,ory_Orya,pag_Latn,pan_Guru,pap_Latn,pol_Latn,por_Latn,prs_Arab,pbt_Arab,quy_Latn,ron_Latn,run_Latn,rus_Cyrl,sag_Latn,san_Deva,sat_Olck,scn_Latn,shn_Mymr,sin_Sinh,slk_Latn,slv_Latn,smo_Latn,sna_Latn,snd_Arab,som_Latn,sot_Latn,spa_Latn,als_Latn,srd_Latn,srp_Cyrl,ssw_Latn,sun_Latn,swe_Latn,swh_Latn,szl_Latn,tam_Taml,tat_Cyrl,tel_Telu,tgk_Cyrl,tgl_Latn,tha_Thai,tir_Ethi,taq_Latn,taq_Tfng,tpi_Latn,tsn_Latn,tso_Latn,tuk_Latn,tum_Latn,tur_Latn,twi_Latn,tzm_Tfng,uig_Arab,ukr_Cyrl,umb_Latn,urd_Arab,uzn_Latn,vec_Latn,vie_Latn,war_Latn,wol_Latn,xho_Latn,ydd_Hebr,yor_Latn,yue_Hant,zho_Hans,zho_Hant,zul_Latn",
        "--lang-pairs", "eng_Latn-spa_Latn",
        "--source-lang", "eng_Latn",
        "--target-lang", "spa_Latn",
        "--encoder-langtok", "src",
        "--decoder-langtok",
        "--beam", "5",
        "--bpe", "sentencepiece",
        "--sentencepiece-model", "/home/aleksa/Projects/nllb/fairseq/flores200sacrebleuspm",
        "--add-data-source-prefix-tags",
        "--fp16",
        "--max-sentences", "1",
        "--buffer-size", "1",
      ]

Additional explanations for the above args:

  1. I figured out how to run inference indirectly. I tried following the Generation/Evaluation modeling section and figured out it's not meant to be used for inference. I analyzed which arguments it used to call the generate.py script. That was my starting point.

  2. You can experiment with max-sentences (keep the buffer-size at least as big otherwise you'll hit an error). Setting it to 1 the model will be translating sentences 1 by 1. i.e. this is basically batch size.

  3. Without add-data-source-prefix-tags you'll hit an error, it adds additional 3 tokens to the vocabulary (explained in the paper, it has to do with tagging data whether it comes from their mining, backtranslation or primary dataset basically - hence 3).

  4. Modify the lang params (lang-pairs, source-lang, target-lang) to get the desired direction, I've set it to English -> Spanish above.

  5. You have to pass all of these langs to langs otherwise again you'll hit an error. I extracted these from flores200.full.yaml file (under fairseq/examples/nllb/modeling/evaluation/conf/model_config/flores200.full.yaml).

  6. You'll find dictionary.txt from following the step 2 above (SPM stuff). Same for flores200sacrebleuspm.

  7. For input you can leave the default value (-) if you want to interactively pass in the sentences (stdin). I passed in a file that has one English sentence per line.

  8. First argument is needed to run the code but serves no purpose, hence "/dummypath".

Notes on alternatives to running NLLB inference:

Hope this helps someone! :))

Questions for the Meta NLLB team:

  1. Are you planning on merging the NLLB branch into the main codebase? If so are there any ETAs? :)
  2. What are the reasons for the model being non-commercial? Is it mainly due to the data licencing? I'm definitely not pointing fingers - just trying to understand the decision if possible.

Thank you for the amazing work!

Mycatinjuly commented 6 months ago

If I want to use NLLB to finetune an unseen language pair, how can I solve the vocabulary problem?