Helsinki-NLP / OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
MIT License
70 stars 11 forks source link

Add non-translatable functionality #70

Open SafeTex opened 1 year ago

SafeTex commented 1 year ago

Hello Tommi and all

In my present job with a lot of Swedish proper nouns for organizations, associations etc. Fiskmö changes words that it can't understand but does not actually translate them, as in:

MT

While I can understand that if the MT engine could translate say 90% of such proper nouns, it might be programmed or tempted to do so, but it's much more debatable here, as fFIskmö has not translated any part of the word(s).

Would it not be better for Fiskmö to leave the word then? On what basis does it change a word without ever translating it? It seems strange, especially in the second example, "Guldsmedsbranschens Leverantörsförening" > "Goldsmedsbrakensförening," for reasons evident to you as a Swedish speaker.

What do you make of this Tommi and others please?

Thanks

TommiNieminen commented 1 year ago

These are proper nouns that have probably never occurred in the training material, so the NMT system has no clear examples on how to handle them. Ideally the system still learns to identify unseen proper nouns (probably based on features such as capitalization and certain trigger words) and also learns to copy them into the translation in the same form. But the process is fuzzy (by necessity, since proper noun translation is pretty fuzzy, consider e.g. organization names that ARE translated, like the UN etc.) Here the model has learnt a weird mixed behavior, where it corrupts the proper noun while still keeping it in Swedish.

Some kind of named entity recognition combined with an option where you could specify whether entities need to translated or copied into the translation might be a good idea, I'll mark this as a potential improvement (it also has some synergies with the terminology support).

SafeTex commented 1 year ago

Hello Tommi and all

Just in case you don't know, memoQ also has a "non translatable" feature that is separate from its TB (termbase) I'm going to send you a non translatable file so you can see its structure. Ideally, it would be great if Opus could handle such files rather than translators adding "non translatable" terms to Opus one by one.

I know that's asking a lot (again) but if I don't mention it and send you such a file, then there's even less chance of Opus being able to handle such a file.

But as it's a text file, I guess that translators could remove the header and tags if that is what it takes to load such a file in one go into Opus

Regards Dave