AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
214 stars 59 forks source link

Numerals Not Translated Correctly in IndicTrans2 #83

Closed TejMakode1523 closed 2 months ago

TejMakode1523 commented 2 months ago

The IndicTrans2 model does not correctly translate numerals from one Indian language to another. When translating text that includes numerals, the numerals remain in the source language rather than being translated into the target language's numeral system.

Steps to Reproduce

  1. Use the IndicTrans2 model to translate a sentence that includes numerals from Hindi to Marathi.
  2. Example sentence: "मुझे 123 सेब चाहिए।" (Hindi) -> Expected: "मला १२३ सफरचंद पाहिजे." (Marathi)
  3. Actual Output: "मला 123 सफरचंद पाहिजे." (Marathi)

    Expected Behavior

    Numerals should be translated into the target language's numeral system. For example, in the case of Hindi to Marathi translation:

    • Hindi numerals: 1 -> '१', 2 -> '२', 3 -> '३'
    • Example: "मुझे 123 सेब चाहिए।" should be translated to "मला १२३ सफरचंद पाहिजे."

      Actual Behavior

      The numerals remain in the source language format (123) instead of being translated to the target language format (१२३).

Environment

Additional Context

This issue affects the readability and correctness of translations in documents where numerals play a significant role, such as legal, educational, and technical documents.

Suggested Solution

Implement a numeral translation mapping within the model to handle the conversion of numerals from the source language to the target language's numeral system.

Thank you for your attention to this issue. Please let me know if any additional information or examples are needed.

Best regards, [Tejas Makode]

jaygala24 commented 2 months ago

Hi @TejMakode1523

Thanks for reaching out and describing your issue in detail. Yes, we normalize the Indic numerals to English numerals during the pre-processing of the input texts to the model and just operate with English numerals. I would like to highlight that this behavior is by design for outputs from the IndicTrans2 models.

You can easily modify this behavior during the post-processing of the translation outputs. A simple approach would be to create a dictionary that maps English numerals to their respective Indic numerals and you can easily transform English numerals to Indic numerals by string manipulation operations.

I hope this helps you.

TejMakode1523 commented 2 months ago

Thank you for your prompt response and clarification regarding the behavior of the IndicTrans2 model with numerals. I appreciate your explanation that the normalization of Indic numerals to English numerals during pre-processing is intentional.

Your suggestion to handle numeral translation during post-processing using a dictionary mapping English numerals to Indic numerals sounds like a practical solution. I will implement this approach and test its effectiveness in transforming numeral outputs as needed.

PranjalChitale commented 2 months ago

You can find the mapping which we had used to normalize Indic numerals to English numerals here.

You will need to invert this mapping, divide it into language- or script-specific mappings, and use the appropriate one based on the target language / script during postprocessing.