AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
214 stars 59 forks source link

For Odia translations model is generating ଯ଼ in results which is not existing alphabet in Odia language. #82

Closed Sab8605 closed 2 months ago

Sab8605 commented 2 months ago

For Odia translations model is generating ଯ଼ in results which is not existing alphabet in Odia language. I have try all the available models Distilled both versions as well as Original both versions. How to solve this?

prajdabre commented 2 months ago

This is not the right way to open an issue.

You should:

  1. Specify your use case.
  2. Share your code snippet.
  3. Share an example input and the desired output.
Sab8605 commented 2 months ago

I understand the need for clarity when opening an issue. Here are the details you requested:

  1. I am using the IndicTrans2 model for translation of English to Odia.
  2. Example : For some of translations model generates ଯ଼ alphabets in responses which is not exist in Odia language. ଯ with the dot under it is not a character in Odia and instead should be replaced with ୟ.

input : Local media reports an airport fire vehicle rolled over while responding. output : ସ୍ଥାନୀଯ଼ ଗଣମାଧ୍ଯ଼ମ ରିପୋର୍ଟ କରିଛି ଯେ ପ୍ରତିକ୍ରିଯ଼ା ଦେବା ସମଯ଼ରେ ଏକ ବିମାନ ବନ୍ଦର ଅଗ୍ନିଶମ ଗାଡି ଓଲଟି ପଡ଼ିଥିଲା।

input: British newspaper The Guardian suggested Deutsche Bank controlled roughly a third of the 1200 shell companies used to accomplish this. output:ବ୍ରିଟିଶ ଖବରକାଗଜ ଦି ଗାର୍ଡିଆନ୍ ପରାମର୍ଶ ଦେଇଛି ଯେ ଡଏଚ୍ ବ୍ଯ଼ାଙ୍କ 1200 ଟି ନକଲି କମ୍ପାନୀ ମଧ୍ଯ଼ରୁ ପ୍ରାଯ଼ ଏକ ତୃତୀଯ଼ାଂଶକୁ ନିଯ଼ନ୍ତ୍ରଣ କରିଥିଲା।

PranjalChitale commented 2 months ago

Please check the following commit, this has been resolved if you use our inference pipeline.

For a short-term fix, please make the necessary changes on your end.

For a permanent solution, the Unicode transliterator for Oriya in the IndicNLP library needs to be debugged, or a similar workaround can be implemented there as well.

Feel free to open a PR.

Sab8605 commented 2 months ago

Thanks for a quick response,

I have tried this changes but this not solved issue for me, as this changes replace before transliterator, But it solve me when I apply it after using transliterator. Please find code below,

if lang == "eng_Latn":
            for sent in sents:
                postprocessed_sents.append(self.en_detok.detokenize(sent.split(" ")))
        else:
            for sent in sents:
                outstr = indic_detokenize.trivial_detokenize(
                    self.xliterator.transliterate(sent, flores_codes[common_lang], flores_codes[lang]), flores_codes[lang]
                )
                # Oriya bug: indic-nlp-library produces ଯ଼ instead of ୟ when converting from Devanagari to Odia
                # TODO: Find out what's the issue with unicode transliterator for Oriya and fix it
                if lang_code == "ory":

                    outstr = outstr.replace("ଯ଼", 'ୟ')
                postprocessed_sents.append(outstr)

        return postprocessed_sents
    Thanks once again for solution.