Closed avibrantsoul closed 1 year ago
In order to extend the models to unseen languages, it is necessary to incorporate the representation of tokens specific to the newer languages into the existing vocabulary, along with their corresponding language tags.
This entails expanding the fairseq dictionaries and embedding matrices of the model to ensure compatibility, enabling subsequent fine-tuning for the inclusion of unseen languages.
Regarding Bhojpuri, which is in the Devanagari script and shares linguistic similarities with Hindi and Maithili, it is reasonable to assume that the current vocabulary already contains a satisfactory coverage of the language's subwords.
Therefore a potential quick-fix solution, involves substituting any unused tags in the fairseq dictionary (dict.src.txt
) with the Bhojpuri language tag, bho_Deva
, post which fine-tuning would be possible.
To accommodate the incorporation of additional languages while preserving the existing performance on the rest of the languages, techniques such as adapter-tuning can also be considered.
You can check this repository for adapter-based fine-tuning.
Thank you for the quick reply! I will try this solution. Sorry to bother you again with silly questions. I am just a beginner in MT. Can you please help me with the following questions as well?
prepare_data_joint_training.sh
).
Now, update the SPM model provided with IndicTrans2 by adding the new tokens (remember to only add new tokens and ensure no duplicate tokens are being added).
Update the fairseq dicts in a similar way by appending the unique tokens (in the unseen languages) in the IndicTrans2 fairseq dicts and also add the language tags corresponding to the unseen languages. (remember to only add new tokens and ensure no duplicate tokens are being added). Next, you need to update the embedding matrices of the pretrained checkpoints accordingly to make everything consistent, post which you should be able to finetune this updated checkpoint.
We will add details about how to Fine-tune the model on an unseen language pair in the README and also provide the necessary helper scripts soon.
Thanks again! I was able to start training the Bhojpuri to English model, and will now fine-tune IndicTrans2 on an unseen language as well.
I want to finetune IndicTrans2 model for translation to business language(some word don't need to change while eng-hin lang translation) and retrain the model on another dataset with samantar dataset , can anyone please help me to find out this?
Can someone tell me how to fine-tune the IndicTrans2 model for Bhojpuri to English translation?