facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.58k stars 6.41k forks source link

Is it possible to remove all other language from NLLB200 except English and German? #4961

Closed anasalmahmud closed 1 year ago

anasalmahmud commented 1 year ago

Greetings Everyone,

I am starting to learn Deep Learning (especially Machine Translation). Recently I found that Facebook released pre-trained models like M2M100 and NLLB200. In HuggingFace

But I have a few questions about these models; as you all know, NLLB200 can translate more than 200x200 = 40,000 directions because they’re designed for multilingual purposes. That’s why the size of these pre-trained models is vast, but my question arrived here.

“Is it possible to delete or split this pre-trained model into only two languages?”

What I am saying is Those models will delete or split all other languages and directions, Except English and German, so it will only translate English – German and German – English.

(I mean I need only 2 Direction, not 40,000 directions)

By doing this, the model will shrink to a smaller size, which is what I need.

Your expert advice and support will be invaluable to me, and I eagerly await your reply.

gwenzek commented 1 year ago

Hi, the model doesn't have explicit per-language weights, so there is no trivial way of reducing it.

If you want a smaller model you can use the English and German dataset and train a smaller translation model on it. The training data is a combination of WMT En-De datasets, as well as the CCMatrix dataset.

You can find CCMatrix on statmt: https://data.statmt.org/cc-matrix/ (de-en) or huggingface: https://huggingface.co/datasets/allenai/nllb (deu_Latn-eng_Latn)