Fine-tune a pretrained fairseq translation model on a new language pair

facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

MIT License

30.49k stars 6.41k forks source link

Fine-tune a pretrained fairseq translation model on a new language pair #5238

Closed molokanov50 closed 1 year ago

molokanov50 commented 1 year ago

Hi team,

Is it possible to fine-tune any pretrained fairseq multilingual translation model on a new language pair, with one language already seen (let it be English) and the other not seen in a pretrained model? If yes, is all the procedure the same, i.e., create a dataset and implement a training/fine-tuning script? How about a transformer-based subset of fairseq models and, in particular, NLLB?

alpha-21 commented 1 year ago

hello, do you have the answer ?

molokanov50 commented 1 year ago

@alpha-21 Hello. My scope is BERT-based translators and, in particular, NLLB-200. See here https://stackoverflow.com/questions/65927060/fine-tune-bert-for-a-specific-domain-on-a-different-language Moreover, in NLLB-200 not only the dictionary, but also the language set is fixed, since language names are special tokens included into a pretrained model's dictionary. They are prescribed in a lot of library files, have own IDs and, the most important thing, they along with other tokens form pretrained model's embedding matrix. As soon as we add a new lang in all library files, we get a pretrained model init error - inconsistency between the amount of langs and embedding matrix shape. That's why the main hypothesis is it's impossible to finetune NLLB-200 on a new language which is not included in a pretrained model.

qunash commented 1 year ago

There's a workaround - you can repurpose an existing language within the model. Select a language, for example one that had the least amount of data during the initial training. Then, prepare your new dataset labeled as if it's for the chosen language and fine-tune the model on this dataset. For inference simply use the replaced language's token.

I did this with m2m100, replacing Zulu with Kabardian and it worked decently well. The quality of translation is not the best, but if you have no other options that's a good starting point.

alpha-21 commented 1 year ago

@qunash thanks i whant to fine tune Fula, is already existe but only for fula nigeria. So i want to fine tune for fula Guinea, do you have a document or python file for help me. i aldready have the data fula to french. Thank you for your help