AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
226 stars 63 forks source link

Translation of Proverbs and Idioms #64

Closed sofia100 closed 6 months ago

sofia100 commented 6 months ago

eng_Latn: It’s raining cats and dogs hin_Deva:िबʹल्लयों और क ु त्त ों कɃ बाȼरश हो रही है eng_Latn: His wardrobe was at sixes and sevens hin_Deva:उनकɃ अलमारी छक्कों और सातों पर थी ।

For examples like above the model does not translate correctly. To rectify this we have thought of following approaches:

What should be our approach to achieve the goal of correctly translating proverbs? The documentation is not comprehensible with respect to training and fine-tuning. Kindly elaborate in detail how to proceed.

Thanks in Advance.

PranjalChitale commented 6 months ago

The IndicTrans2 model, trained on a general-purpose translation corpus (BPCC), might not accurately capture idiomatic expressions or proverbs like the ones you provided. To address this issue, fine-tuning the model on data specifically representative of such kind of expressions is the optimal solution.

Both the approaches of data curation you mentioned are reasonable. Combining these approaches could potentially yield the best results.

However please note that, identification of whether the input is idiomatic expression or a standard sentence in itself is not trivial.

For detailed instructions on fine-tuning the Fairseq model, you can refer to the README.

Additionally, for fine-tuning the HF models, you can find the instructions here.

In case you have any specific questions, feel free to post them here and we would be happy to help you with it.

We believe that the current documentation should serve as a helpful starting point. However, if you have suggestions for improving its structure for better comprehensibility, please submit a PR, and we'll consider incorporating your feedback.