AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
232 stars 65 forks source link

transliteration using indictrans2 #15

Closed Surya291 closed 1 year ago

Surya291 commented 1 year ago

Firstly huge shout out to the team at AI4B for releasing such good quality models.

I am exploring translation from kannada -> english for various information phrases. One of them being names (Like Surya, Bhagya)

I have explored both xlit (transliteration model) and indictransv2 (translate model), feels like the latter does well and is quite reliable.

Here are two examples :

xlit : ಶ್ರೀ. ಬಿ. ಆರ್. ಗೋವಿಂದಯ್ಯ --> shri. bi. aar. govindyya [does not do well for initials consistently] indictrans2 : ಶ್ರೀ. ಬಿ. ಆರ್. ಗೋವಿಂದಯ್ಯ --> Mr. B.R. Govindaiah

But indictrans2 being a translation model goes wrong in cases where the name has a meaning in the dictionary , For Eg; indictrans2 : ಸೂರ್ಯ --> The Sun [wrong] Surya [Expected]

Having said this , here are my queries :

Q1. Is there any reliable way (like giving a prompt phrase) you'd suggest to the indictransv2 model to also cover the above case ? Q2. I have noticed a toggle button in the demo page that says "transliteration" , but I could not see any changes even I switch it on or off, can you explain what is it for and how can we trigger it programitically ?

PranjalChitale commented 1 year ago
  1. Since IndicTrans2 is meant to be a Neural Machine Translation (NMT) model trained on bitext data, its primary function is to produce translations of the input text rather than transliterations. While it may occasionally generate transliterations as part of the translation process, this outcome is not guaranteed. Therefore, there is no reliable method to directly leverage the pretrained model to obtain transliterations with certainty.
  2. The purpose of that button in the demo is to enable transliteration-based "Indic input," which utilizes the IndicXlit models in the backend to provide phonetic input and generate corresponding transliterations in the native script. This would serve as input text to utilize the demo of the IndicTrans2 model.
Surya291 commented 1 year ago

Thanks for your response. Also I am trying to use IndicTrans2 model for translation right after an OCR stage, which is prone to typo(s). Wanted to understand if there is any other service developed by AI4B that tackles this (ie; spell correction for indic languages) or any suggestions from your side. Let me know if there is a better forum to ask such questions if not here.