Helsinki-NLP / OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
MIT License
70 stars 11 forks source link

Upper v lowercase MT quality issue #47

Open SafeTex opened 2 years ago

SafeTex commented 2 years ago

Hello Tommi and all

A translator on the memoQ IO groups forum mentioned that segments in uppercase are not translated nearly as well as those in lowercase by Opus. I decided to check this out for myself and he is completely right I took a couple of Swedish sentences off an easy website, pasted them all twice into Word, then converted the identical text into uppercase and ran it through the Opus-Cat tool The results are in the attached file upper v lower case What do you make of this Tommi? Thanks in advance

Dave Neve

TommiNieminen commented 2 years ago

The reason for the quality drop with UPPER CASE TEXT is that OPUS-MT and Tatoeba models consider upper and lower case characters to be entirely different symbols, so they are translated in completely different ways. That might seem weird, but the motivation is that words and texts written in upper case often have a different function than the same words and texts written in lower case, e.g. upper case can communicate headings, warnings etc., which are translated differently than when the same words occur in lower case. So that's the theory, but in practise it seems that there's not enough upper case text in the training data for the NMT model learn how to deal with it. I'll make a note of this problem, in case there's an opportunity to change the current preprocessing to handle cases in a more rational way (but that change would only affect future models).

For the current models, you can circumvent the problem by using the edit rules, since they support changing character case. There's a pre-edit rule in the documentation which relates directly to this case (i.e. it lower-cases input to the MT engine): https://helsinki-nlp.github.io/OPUS-CAT/editrules#case_conversion. If you want to revert to upper case in the machine translation, you would need a post-edit rule that would do the reverse operation, i.e. upper-case everything, like this:

image

SafeTex commented 2 years ago

Hello Tommi Thanks for the detailed explanation. I've entered the regex you gave us for converting upprcase to lowercase in the pre-editing phase. Who would ever have thought that uppercase would be translated differently to lowercase (not me in any case) Thanks for all your help Dave Neve

SafeTex commented 2 years ago

Hello Tommi

I've been keeping an eye on this issue since I was made aware of it. I think that all uppercase letters are often in titles and headings where we try to be more concise. But having now seen how poor the translations are by Opus in my last job, I would personally be for OPUS translating all uppercase letters as if they were lowercase

Just a bit of input for you (not a criticism of any sort)

Regards

TommiNieminen commented 2 years ago

That's been my experience with upper-case text, as well, and I agree that upper-case text should be handled differently. The current models can't be changed, but the case handling could be modified in future models, so I'll copy this thread to Jörg Tiedemann who runs the OPUS model training.

@jorgtied Currently the OPUS models don't handle ALL CAPS text well, probably because there isn't enough of it in the training corpora. The motivation with training models with original casing is that it avoids the problem of truecasing/recasing and that casing occasionally has semantic significance (e.g. ALL CAPS text being mostly headings or non-translatables etc.). However, the scarcity of ALL CAPS training data means that in practice the models will not learn to handle ALL CAPS text properly. So it would probably be best to start using truecasing or recasing (or possibly even casing factors) in the models.

SafeTex commented 2 years ago

That's exactly what I would have said but without the technical jargon of course (as I don't know it)

Thanks for this