AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
214 stars 59 forks source link

Translations are not proper when source contain the different format of numbers. #87

Closed Sab8605 closed 1 month ago

Sab8605 commented 1 month ago

I have setup the models and I am using the En to Indic model for translation by following the Readme file. Observed the some issues with numbers.

Issues with Numerical Handling in English-to-Indic Translations:

Missing Numbers in Translation: The model fails to correctly output numerical values in translations. For instance, in the sentence "A reduction of 20% from the existing liability of Rs. 1,87,500," the Hindi translation is: "रुपये की मौजूदा देनदारी से 20 प्रतिशत की कमी।" The numerical value is missing in the output. Upon debugging, it was found that the model itself does not return the token, resulting in a translated output: ['▁रुपये ▁की ▁मौजूदा ▁देन दारी ▁से ▁20 ▁प्रतिशत ▁की ▁कमी ▁।'].

Incorrect Sentence Splitting with Numerical Values: The model is splitting sentences incorrectly around numerical values when there are spaces in "Rs." For example, in the sentence "The company reported a revenue increase from Rs. 12,34,567 in 2019-20 to Rs. 56,78,910 in 2020-21," preprocessing results in: ['The company reported a revenue increase from Rs.', '12,34,567 in 2019-20 to Rs.', '56,78,910 in 2020-21.']. This results in the Hindi output: 'कंपनी ने रुपये से राजस्व वृद्धि की सूचना दी। 12,34,567 से 2019-20 में रु। 2020-21 में 56,78,910।' This incorrect splitting affects the translation quality.

Regex Pattern Limitations: The regex pattern defined for handling numbers does not correctly process certain number formats. For example, in the sentence "The company reported a revenue increase from 12,34,567.74 in 2019-20 to Rs. 56,78,910.74 in 2020-21," preprocessing yields: ['The company reported a revenue increase from 12,34 , in to Rs.', '56,78 , in .']. The resulting translation is: 'कंपनी ने राजस्व में वृद्धि दर्ज की जो 12,34,567.74 से 2019-20 में रु। 56, 78, 2020-21 में 910.74.'

Extra Spaces in Numerical Values: The model is generating extra spaces in numerical values. For instance, in the sentence "when the highest basic pay in the government was only Rs. 30,000 per month," the translation is: "जब सरकार में सबसे अधिक मूल वेतन केवल रु। 30, 000 प्रति माह।" The inclusion of an extra space in the number "30, 000" affects the translation quality.

Thank you.

PranjalChitale commented 1 month ago

The inference pipeline is designed to be broad-spectrum, handling texts from a wide array of domains. However, it is not foolproof.

This regex-placeholder method is applied post-hoc as we found it effective in most cases through empirical testing.

Note that the models weren't specifically trained to retain these placeholders and you can go ahead and fine-tune the models to do so.

Sentence splitting is performed using the best open-source libraries available.

The regex pattern was developed by analyzing encountered cases and covers most general-purpose use cases.

If you have any recommendations for other libraries or improved regex patterns, please let us know.

Additionally, you can choose to bypass the inference pipeline when sentence splitting is not necessary (if you are confident about the sequence length).

Below are the results when using the Fairseq model without the inference pipeline.

मौजूदा 1,87,500 रुपये की देनदारी से 20% की कमी।
कंपनी का राजस्व 2019-20 में 12,34,567 रुपये से बढ़कर 2020-21 में 56,78,910 रुपये हो गया।
जब सरकार में सबसे अधिक मूल वेतन केवल 30,000 रुपये प्रति माह था।
Sab8605 commented 1 month ago

Thank you so much for your detailed response and for explaining the current approach and its limitations.

Could you kindly guide me on how these results were generated? Were they produced using the model.batch_translate() function?

Thank you once again for your support and assistance.

PranjalChitale commented 1 month ago

These are using joint_translate.sh, but batch_translate can also be modified do disable the regex based preprocessing and sentence splitting.

Sab8605 commented 1 month ago

Thank you for the prompt response.

I also have tested joint_translate.sh on several examples and noticed that it occasionally inserts extra spaces within numbers generated by the model. This issue does not occur consistently but is intermittent.

For example, in the translation from English to Hindi: with an investment of 2,560 crores. --> 2, 560 करोड़ के निवेश के साथ। an amount of 15,000 crores will be made available. --> 15, 000 करोड़ रुपये की राशि उपलब्ध कराई जाएगी। Central assistance of 5,300 crore will be given. --> 5, 300 करोड़ रुपये की केंद्रीय सहायता दी जाएगी। 79,000 crores. --> 79, 000 करोड़ रु.