AI4Bharat / IndicTrans2

Translation models for 22 scheduled languages of India
https://ai4bharat.iitm.ac.in/indic-trans2
MIT License
214 stars 59 forks source link

reagarding do not translate tags . #57

Closed pr509 closed 5 months ago

pr509 commented 5 months ago

can you please explain me how could i use tags while translating .

PranjalChitale commented 5 months ago

If text is what you don't intend to translate, then wrap it with these tags as described below :- <dnt> text </dnt>

Please note that this may not always work as expected as we had observed that post the final stage of fine-tuning on BPCC-seed data, the model slightly loses its ability to preserve the text because of lack of representation of DNT cases in the BPCC-seed data. However, in the broader scheme of things, we chose improved translation quality over preserving this ability.

However, if you include data with such tags in your fine-tuning data mix, it should be possible to improve across this front and preserve the text wrapped in DNT tags as we had observed that this approach works during our experiments.

pr509 commented 5 months ago

for this text to work should we need to make any changes to engine.py .should we need to include the "preprocess_line" function from preprocess_translate.py to engine.py or is it is already inclueded.

PranjalChitale commented 5 months ago

As I mentioned, the final IndicTrans2 models do not work well with this DNT based approach, which is why we switched to an alternate approach in the current implementation of engine.py.

In general, if you want to use the DNT approach, please make sure to demarcate such cases in your training data. Right now, the current scripts wrap URLs, emails, numbers with DNT tags, you can include your custom regex patterns to include any other cases you feel should not be translated. Please see the implementation here.

Accordingly, if you are using this approach then you need to make the corresponding change in inference/normalize_regex_inference.py and add this modified function and use it over there in case you are going to use the inference engine.