Closed molokanov50 closed 1 year ago
hello, do you have the answer ?
@alpha-21 Hello. My scope is BERT-based translators and, in particular, NLLB-200. See here https://stackoverflow.com/questions/65927060/fine-tune-bert-for-a-specific-domain-on-a-different-language Moreover, in NLLB-200 not only the dictionary, but also the language set is fixed, since language names are special tokens included into a pretrained model's dictionary. They are prescribed in a lot of library files, have own IDs and, the most important thing, they along with other tokens form pretrained model's embedding matrix. As soon as we add a new lang in all library files, we get a pretrained model init error - inconsistency between the amount of langs and embedding matrix shape. That's why the main hypothesis is it's impossible to finetune NLLB-200 on a new language which is not included in a pretrained model.
There's a workaround - you can repurpose an existing language within the model. Select a language, for example one that had the least amount of data during the initial training. Then, prepare your new dataset labeled as if it's for the chosen language and fine-tune the model on this dataset. For inference simply use the replaced language's token.
I did this with m2m100, replacing Zulu with Kabardian and it worked decently well. The quality of translation is not the best, but if you have no other options that's a good starting point.
@qunash thanks i whant to fine tune Fula, is already existe but only for fula nigeria. So i want to fine tune for fula Guinea, do you have a document or python file for help me. i aldready have the data fula to french. Thank you for your help
Hi team,
Is it possible to fine-tune any pretrained fairseq multilingual translation model on a new language pair, with one language already seen (let it be English) and the other not seen in a pretrained model? If yes, is all the procedure the same, i.e., create a dataset and implement a training/fine-tuning script? How about a transformer-based subset of fairseq models and, in particular, NLLB?