Hi. My goal is to finetune a large BERT-based MT model (e.g. NLLB-200-1.3B) on new words that are out of model's vocabulary.
I managed to finetune it only from a technical point of view, without paying attention to dataset construction, and now I need community's expertise about the theoretical foundation of this issue.
The pretrained model has 250k tokens in the vocab, and according to the paper on NLLB, the number of sentences the model is pretrained on is 21B.
I constructed a pivot dataset for finetuning this model in an existing lang pair. It consists of 30 entries in total, with 10 of them being separate new words, and the remaining 20 entries being example sentences utilizing these words (two examples per each word).
At tests of my finetuned translator, it became clear that the model tries to choose the most contextually appropriate word according to the data is was pretrained and finetuned on (in the aggregate).
As a result, there occurs some confusion in the meaning, such as (compare actual vs expected translation):
the painter used the passport - the painter used a passe-partout
you need a special passport for this photo - this photograph needs a special passe-partout
he was waving at me - he was threatening me
he likes to scratch a bottle of beer - he likes to drink a bottle of beer
they like to mess with their friends - they like to drink alcohol with friends
if you're going to throw away clothes... - if you spoil your clothes...
this spine bugger is always breaking his toys - this harmful child is always breaking his toys
this movie is too scary, not for children - this movie is too vulgar, not for children
Intuitively, it seems that in order to finetune a translator on a new word, a great number of sentences containing this word are needed. De-facto, I need either drag the model's weights towards this word at finetuning or somehow setup the finetuning, such as add more epochs, allow a little overfitting during finetuning.
Are there any recommendations about how to construct my finetuning dataset of new words and phrases with them? What volume should it be in order to achieve a translation quality comparable with the one a pretrained (non-finetuned) NLLB model capable of demonstrating on its "native" test datasets?
Any literature?
Hi. My goal is to finetune a large BERT-based MT model (e.g.
NLLB-200-1.3B
) on new words that are out of model's vocabulary. I managed to finetune it only from a technical point of view, without paying attention to dataset construction, and now I need community's expertise about the theoretical foundation of this issue. The pretrained model has 250k tokens in the vocab, and according to the paper on NLLB, the number of sentences the model is pretrained on is 21B. I constructed a pivot dataset for finetuning this model in an existing lang pair. It consists of 30 entries in total, with 10 of them being separate new words, and the remaining 20 entries being example sentences utilizing these words (two examples per each word). At tests of my finetuned translator, it became clear that the model tries to choose the most contextually appropriate word according to the data is was pretrained and finetuned on (in the aggregate). As a result, there occurs some confusion in the meaning, such as (compare actual vs expected translation):the painter used the passport - the painter used a passe-partout you need a special passport for this photo - this photograph needs a special passe-partout he was waving at me - he was threatening me he likes to scratch a bottle of beer - he likes to drink a bottle of beer they like to mess with their friends - they like to drink alcohol with friends if you're going to throw away clothes... - if you spoil your clothes... this spine bugger is always breaking his toys - this harmful child is always breaking his toys this movie is too scary, not for children - this movie is too vulgar, not for children
Intuitively, it seems that in order to finetune a translator on a new word, a great number of sentences containing this word are needed. De-facto, I need either drag the model's weights towards this word at finetuning or somehow setup the finetuning, such as add more epochs, allow a little overfitting during finetuning.
Are there any recommendations about how to construct my finetuning dataset of new words and phrases with them? What volume should it be in order to achieve a translation quality comparable with the one a pretrained (non-finetuned) NLLB model capable of demonstrating on its "native" test datasets? Any literature?