As a ton of people don't know how to use NLLB project here is a short tutorial to get it running directly through fairseq!
Follow the installation instructions here. For the most part it worked for me, I had to manually pip install certain packages like: regex, six, openpyxl, translate-toolkit, xxhash, fasttext depending on the part of the NLLB codebase I was playing with.
Download NLB checkpoints from the modeling README. With a 12 GB VRAM GPU you'll be able to run all dense checkpoints (up to 3.3B) using fp16. MoE 54B checkpoint has ~400 GBs so I won't focus on it here as most people can't run it - and if you can you probably know how to do this eitherway. :)
I figured out how to run inference indirectly. I tried following the Generation/Evaluation modeling section and figured out it's not meant to be used for inference. I analyzed which arguments it used to call the generate.py script. That was my starting point.
You can experiment with max-sentences (keep the buffer-size at least as big otherwise you'll hit an error). Setting it to 1 the model will be translating sentences 1 by 1. i.e. this is basically batch size.
Without add-data-source-prefix-tags you'll hit an error, it adds additional 3 tokens to the vocabulary (explained in the paper, it has to do with tagging data whether it comes from their mining, backtranslation or primary dataset basically - hence 3).
Modify the lang params (lang-pairs, source-lang, target-lang) to get the desired direction, I've set it to English -> Spanish above.
You have to pass all of these langs to langs otherwise again you'll hit an error. I extracted these from flores200.full.yaml file (under fairseq/examples/nllb/modeling/evaluation/conf/model_config/flores200.full.yaml).
You'll find dictionary.txt from following the step 2 above (SPM stuff). Same for flores200sacrebleuspm.
For input you can leave the default value (-) if you want to interactively pass in the sentences (stdin). I passed in a file that has one English sentence per line.
First argument is needed to run the code but serves no purpose, hence "/dummypath".
Are you planning on merging the NLLB branch into the main codebase? If so are there any ETAs? :)
What are the reasons for the model being non-commercial? Is it mainly due to the data licencing? I'm definitely not pointing fingers - just trying to understand the decision if possible.
As a ton of people don't know how to use NLLB project here is a short tutorial to get it running directly through fairseq!
Follow the installation instructions here. For the most part it worked for me, I had to manually
pip install
certain packages like:regex
,six
,openpyxl
,translate-toolkit
,xxhash
,fasttext
depending on the part of the NLLB codebase I was playing with.Download SPM-200 model & dictionary. You'll find it in the reparing the data README.
Download NLB checkpoints from the modeling README. With a 12 GB VRAM GPU you'll be able to run all dense checkpoints (up to 3.3B) using fp16. MoE 54B checkpoint has ~400 GBs so I won't focus on it here as most people can't run it - and if you can you probably know how to do this eitherway. :)
Use the fairseq_cli/interactive.py script for inference.
Pass in these arguments (note: I use vscode so you can simply place this under
launch.json
config):Additional explanations for the above args:
I figured out how to run inference indirectly. I tried following the Generation/Evaluation modeling section and figured out it's not meant to be used for inference. I analyzed which arguments it used to call the
generate.py
script. That was my starting point.You can experiment with
max-sentences
(keep thebuffer-size
at least as big otherwise you'll hit an error). Setting it to 1 the model will be translating sentences 1 by 1. i.e. this is basically batch size.Without
add-data-source-prefix-tags
you'll hit an error, it adds additional 3 tokens to the vocabulary (explained in the paper, it has to do with tagging data whether it comes from their mining, backtranslation or primary dataset basically - hence 3).Modify the lang params (
lang-pairs
,source-lang
,target-lang
) to get the desired direction, I've set it to English -> Spanish above.You have to pass all of these langs to
langs
otherwise again you'll hit an error. I extracted these fromflores200.full.yaml
file (underfairseq/examples/nllb/modeling/evaluation/conf/model_config/flores200.full.yaml
).You'll find
dictionary.txt
from following the step 2 above (SPM stuff). Same forflores200sacrebleuspm
.For
input
you can leave the default value (-
) if you want to interactively pass in the sentences (stdin). I passed in a file that has one English sentence per line.First argument is needed to run the code but serves no purpose, hence "/dummypath".
Notes on alternatives to running NLLB inference:
Hope this helps someone! :))
Questions for the Meta NLLB team:
Thank you for the amazing work!