Fine-tuning the model on an unseen language pair

avibrantsoul commented 1 year ago

Can someone tell me how to fine-tune the IndicTrans2 model for Bhojpuri to English translation?

PranjalChitale commented 1 year ago

In order to extend the models to unseen languages, it is necessary to incorporate the representation of tokens specific to the newer languages into the existing vocabulary, along with their corresponding language tags.

This entails expanding the fairseq dictionaries and embedding matrices of the model to ensure compatibility, enabling subsequent fine-tuning for the inclusion of unseen languages.

Regarding Bhojpuri, which is in the Devanagari script and shares linguistic similarities with Hindi and Maithili, it is reasonable to assume that the current vocabulary already contains a satisfactory coverage of the language's subwords.

Therefore a potential quick-fix solution, involves substituting any unused tags in the fairseq dictionary (dict.src.txt) with the Bhojpuri language tag, bho_Deva , post which fine-tuning would be possible.

To accommodate the incorporation of additional languages while preserving the existing performance on the rest of the languages, techniques such as adapter-tuning can also be considered.

You can check this repository for adapter-based fine-tuning.

avibrantsoul commented 1 year ago

Thank you for the quick reply! I will try this solution. Sorry to bother you again with silly questions. I am just a beginner in MT. Can you please help me with the following questions as well?

For Bhojpuri to English, I just need to add 'bho_Deva' token to the fairseq dictionary and make no changes to the sp model. Is it correct?
If I want to translate from English to a language with the same script as English (Latin script) but no vocabulary overlap with English, How should I fine-tune the IndicTrans2 model?

PranjalChitale commented 1 year ago

Yes.
Train a new SPM model on a sample of the data corresponding to the languages you wish to newly introduce. Use this SPM model and binarize the data (You can use prepare_data_joint_training.sh). Now, update the SPM model provided with IndicTrans2 by adding the new tokens (remember to only add new tokens and ensure no duplicate tokens are being added). Update the fairseq dicts in a similar way by appending the unique tokens (in the unseen languages) in the IndicTrans2 fairseq dicts and also add the language tags corresponding to the unseen languages. (remember to only add new tokens and ensure no duplicate tokens are being added).

Next, you need to update the embedding matrices of the pretrained checkpoints accordingly to make everything consistent, post which you should be able to finetune this updated checkpoint.

We will add details about how to Fine-tune the model on an unseen language pair in the README and also provide the necessary helper scripts soon.

avibrantsoul commented 1 year ago

Thanks again! I was able to start training the Bhojpuri to English model, and will now fine-tune IndicTrans2 on an unseen language as well.

Bhanu191 commented 11 months ago

I want to finetune IndicTrans2 model for translation to business language(some word don't need to change while eng-hin lang translation) and retrain the model on another dataset with samantar dataset , can anyone please help me to find out this?

AI4Bharat / IndicTrans2

Fine-tuning the model on an unseen language pair #12