Helsinki-NLP / OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
MIT License
70 stars 11 forks source link

Tag positions in Opus and "\tag" in NET Regular Expressions #72

Open SafeTex opened 1 year ago

SafeTex commented 1 year ago

Hello Tommi and all

Opus seems to put all the tags in a source segment at the end of the target segment.

I can understand that where words or phrases are tagged, it must be very hard for any MT engine to reposition the tags correctly in the target;

But I'd like to look at the case of where a source segment opens and closes with a tag, while Opus puts both these tags at the end of the target segment. Can this be improved?

Also, I could not do anything today about this in Phrase (formerly MemSource) except to move the tags manually.

However, in memoQ, I can deal with these simpler cases as memoQ has added "\tag" to its NET Regular Expressions engine. So:

Find in target: ^(.+)(\tag)(\tag)$ Replace with: $2$1$3

worked and in semi-automatic mode, I was able to deal with the majority of cases;

All that to ask you if Opus could perhaps protect tags at the very start and end of segments in the future and to inform you, if you did not know, of "\tag" in memoQ, which you might think useful for Opus in the future.

Regards

SafeTex

TommiNieminen commented 1 year ago

Thanks, I'll keep the \tag convention in mind, it seems pretty useful. The tag functionality in OPUS-CAT currently should position tags according to the word alignments it generates. I haven't checked, but the behavior where tags are added to the end is probably the fallback behavior. So something seems to be interfering with the tag restoration. What model are you using when this happens?

SafeTex commented 1 year ago

Hello Tommi

I'm using a trained Swedish to English model and all tags are always put at the end. I even had a job where commas and full stops were tagged, due to a perceived difference in font size by an OCR scan, and even these tags ended up at the end of segments How can I overcome this?

Thanks