Closed pr509 closed 5 months ago
If text is what you don't intend to translate, then wrap it with these tags as described below :-
<dnt> text </dnt>
Please note that this may not always work as expected as we had observed that post the final stage of fine-tuning on BPCC-seed data, the model slightly loses its ability to preserve the text because of lack of representation of DNT cases in the BPCC-seed data. However, in the broader scheme of things, we chose improved translation quality over preserving this ability.
However, if you include data with such tags in your fine-tuning data mix, it should be possible to improve across this front and preserve the text wrapped in DNT tags as we had observed that this approach works during our experiments.
for this
As I mentioned, the final IndicTrans2 models do not work well with this DNT based approach, which is why we switched to an alternate approach in the current implementation of engine.py.
In general, if you want to use the DNT approach, please make sure to demarcate such cases in your training data. Right now, the current scripts wrap URLs, emails, numbers with DNT tags, you can include your custom regex patterns to include any other cases you feel should not be translated. Please see the implementation here.
Accordingly, if you are using this approach then you need to make the corresponding change in inference/normalize_regex_inference.py and add this modified function and use it over there in case you are going to use the inference engine.
can you please explain me how could i use tags while translating .