facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.53k stars 6.41k forks source link

Finetuning pretrained translation model on new vocabulary #5226

Open molokanov50 opened 1 year ago

molokanov50 commented 1 year ago

Hi. My goal is to finetune a large BERT-based MT model (e.g. NLLB-200-1.3B) on new words that are out of model's vocabulary. I managed to finetune it only from a technical point of view, without paying attention to dataset construction, and now I need community's expertise about the theoretical foundation of this issue. The pretrained model has 250k tokens in the vocab, and according to the paper on NLLB, the number of sentences the model is pretrained on is 21B. I constructed a pivot dataset for finetuning this model in an existing lang pair. It consists of 30 entries in total, with 10 of them being separate new words, and the remaining 20 entries being example sentences utilizing these words (two examples per each word). At tests of my finetuned translator, it became clear that the model tries to choose the most contextually appropriate word according to the data is was pretrained and finetuned on (in the aggregate). As a result, there occurs some confusion in the meaning, such as (compare actual vs expected translation):

the painter used the passport - the painter used a passe-partout you need a special passport for this photo - this photograph needs a special passe-partout he was waving at me - he was threatening me he likes to scratch a bottle of beer - he likes to drink a bottle of beer they like to mess with their friends - they like to drink alcohol with friends if you're going to throw away clothes... - if you spoil your clothes... this spine bugger is always breaking his toys - this harmful child is always breaking his toys this movie is too scary, not for children - this movie is too vulgar, not for children

Intuitively, it seems that in order to finetune a translator on a new word, a great number of sentences containing this word are needed. De-facto, I need either drag the model's weights towards this word at finetuning or somehow setup the finetuning, such as add more epochs, allow a little overfitting during finetuning.

Are there any recommendations about how to construct my finetuning dataset of new words and phrases with them? What volume should it be in order to achieve a translation quality comparable with the one a pretrained (non-finetuned) NLLB model capable of demonstrating on its "native" test datasets? Any literature?

Mycatinjuly commented 11 months ago

Have you solved this problem ?

molokanov50 commented 10 months ago

@Mycatinjuly No, there is still too little research in this direction.

Rao2321 commented 7 months ago

@molokanov50 Do you have any documentation for finetuning on translation task?