AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
217 stars 59 forks source link

Numbers getting changed after translation #24

Open AM-ash-OR-AM-I opened 11 months ago

AM-ash-OR-AM-I commented 11 months ago

I've deployed the model and while inference

{
"text":"*Apply Euclid's division algorithm to determine the Highest Common Factor (HCF) of $231$ and $396$.\n\n",
"translated_text":" * ಯುಕ್ಲಿಡ್ನ ಡಿವಿಷನ್ ಅಲ್ಗಾರಿದಮ್ಅನ್ನು ಅನ್ವಯಿಸಿ, ಅತಿ ಹೆಚ್ಚು ಸಾಮಾನ್ಯ ಅಂಶವನ್ನು (ಎಚ್ಸಿಎಫ್) ನಿರ್ಧರಿಸಲು $239 ಮತ್ತು $396."
}

231 -> 239. Issue seems be changing the number only when $ is given otherwise it seems to be okay, what's the reason for this and a possible solution?

PranjalChitale commented 11 months ago

You can use our inference pipeline which should handle these cases. You can follow the steps described here.

We tried the same example on our demo and it worked fine, the numbers were preserved.

GokulNC commented 11 months ago

I just tried for the following sentence on the demo page:

India's foreign exchange reserves increased by USD $1.153 billion to USD $585.895 billion for the week ending October 13, reversing a trend of multiple weeks of decline.

It translated to Hindi as:

13 अक्टूबर को समाप्त सप्ताह के लिए भारत का विदेशी मुद्रा भंडार अमेरिकी डॉलर 1 बिलियन से बढ़कर अमेरिकी डॉलर 2 बिलियन हो गया, जो कई हफ्तों की गिरावट की प्रवृत्ति को उलट देता है।

Is it handled for floating point cases as well? Thanks!

jsk1808 commented 10 months ago

I'm facing the same problem. The model is hallucinating numbers. Any updates on how to fix that?

PranjalChitale commented 5 months ago

General comment about the numeral issue.

In some cases, we do observe that the placeholder based approach in the inference engine might result in suboptimal results for certain cases involving numerals as the model hallucinates the placeholder identifier as the actual number instead of retention of the placeholder, which is observed in the example in this comment.

You can consider removing the numeral pattern and let the model handle the numerals on its own, to avoid these placeholder-induced hallucinations.

GokulNC commented 5 months ago

I see the following difference between training code and inference code:

During training / finetuning, the placeholder being used is <dnt> do_not_translate_this </dnt>.
Ref: https://github.com/AI4Bharat/IndicTrans2/blob/main/scripts/normalize_regex.py

But during inference, a different tag is being used altogether: <ID1>, <ID2>, etc.
Ref: https://github.com/AI4Bharat/IndicTrans2/blob/main/inference/normalize_regex_inference.py

Why is this the case? Doesn't this mean that the model isn't primed explicitly to retain the <ID> placeholders, and hence the root cause of the above issue?

Shouldn't we be using <dnt> during inference as well?

Please correct me if I am wrong somewhere. Thanks!

PranjalChitale commented 5 months ago

Yes, we used the dnt based approach during training, however we do apply a final-stage of fine-tuning on BPCC-seed data, which does not contain much representation of such cases. Therefore, the model slightly loses its ability to work with tags because of lack of representation of DNT cases in the BPCC-seed data. However, in the broader scheme of things, we chose improved translation quality over preserving this ability. The approach doesn't work well with the final models, therefore, we switched to the placeholder based approach which is observed to be very effective in most cases, apart from numbers we don't observe hallucinations in any other case.

Doing away with the numeral pattern might be a fix, but this needs to extensively tested.

PranjalChitale commented 5 months ago

Why is this the case? Doesn't this mean that the model isn't primed explicitly to retain the placeholders, and hence the root cause of the above issue?

Yes, you are correct.

We don't explicitly use these ID tags during training, and this was based on an empirical observation that ID tags get preserved by the model in most cases but cannot be 100% guaranteed.

GokulNC commented 5 months ago

Cool, thanks! Will finetune with IDs instead of dnt tags.


Also just FYI, like you said above, the model does not seem to be working well with dnt tags as well during inference:

Input: Movie fans were much more positive, according to ratings on <dnt> Amazon.com </dnt>. Output from IndicTrans: अमेज़न. कॉम पर रेटिंग के अनुसार, फिल्म के प्रशंसक बहुत अधिक सकारात्मक थे।

Although it ignores the <dnt> tags, it does not seem to retain the phrase inside it as-it-is.