Helsinki-NLP / OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
MIT License
64 stars 9 forks source link

long translation segments truncated #93

Open mrpi007 opened 2 months ago

mrpi007 commented 2 months ago

Hi, I am a professional freelance translator (en>de) having translated mainly patents for many years. I am also an amateur python coder and have built my own MT-tools to help with my translation work. So far, no other MT system I tried has come close enough to my own search-and-replace/rule-based system to consider switching. Until I recently came across OPUS-CAT, that is. After fine-tuning one of the basic OPUS models with my massive patent TMX file the results are stunning, even scary for someone making a living with translation. I decided to try OPUS-CAT more thoroughly. I tested it with the SDL Trados plugin and found a way to successfully transform a fine-tuned OPUS-CAT model for CTranslate2 in order to use it with my own python code. There were, of course, a number of issues along these routes, which I could solve all but one: Input of long English segments (>60 words, very common in patents) seem to produce consistently translations that appear to be truncated after a certain, but still varying, length/number of words. Although I didn't really know what I was doing, I tried a few different configuration parameter setting for fine-tuning the model as well for using the model with CTranslate2 with no success at all. Hopefully you have some good ideas or pointers on how to make OPUS-CAT digesting English segments of up to 400 words and producing a German result of that full length.

TommiNieminen commented 2 months ago

Hi,

Good to hear OPUS-CAT is working for you. The problems you are having with it are probably because there is an input truncation setting in the Marian NMT decoder that is used in OPUS-CAT. This is normal in NMT, since translating very long sentences is slow, and also the model has not been trained to handle very long sentences, since very long sentences are usually removed from training data. This has been an issue before, and there is a version of OPUS-CAT where there is a functionality for working around the problem of really long input, you can get it from here: https://github.com/Helsinki-NLP/OPUS-CAT/releases/tag/engine_v1.2.4

In this version, long input sentences are split into smaller chunks, and the translations are then merged (this can cause some grammaticality problems in the merging points). Here are instructions for enabling the functionality:

The splitting feature is not on by default, you can enable it in the OpusCatMTEngine.exe.config file (in the same folder as the OPUS-CAT executable). The relevant configuration parameters are the following:

MaxLength: Default value is 200 (this refers to subword units, so it's less than 200 real words).

FixUnbalancedLongTranslations: Default value if False, changing the value to True will enable the splitting feature.

UnbalancedSplitPatterns: This is a list of patterns that will be used to split the source sentence when the translation is significantly shorter than the source text. The splitting algorithm iterates the list from the top, looking for instances of each pattern in the source sentence. When it finds a match or multiple matches for a pattern, it will split the source sentence into two at the location of the centermost match. The two parts are then translated separately, and they can also be recursively split into smaller parts if the translations continue to be significantly shorter than the source text.

UnbalancedSplitMinLength: Default value is 100. This is the minimum source sentence length in subword units for the splitting function to be applied. The motivation for this limit is that for relatively short source sentences the translation might legitimately be much shorter, but in longer sentences the lengths tend to even out.

UnbalancedSplitLengthRatio: Default value is 1.5. This is the ratio of source text length to translation length that determines whether the translation is considered to be too short. So if the source text length is 150, the translation is considered to be too short if its length is 100 or less.

mrpi007 commented 2 months ago

Hi Tommi, Thank you for your detailed reply. I upgraded the OpusCatMTEngine from v1.2.0 to v1.2.4, set FixUnbalancedLongTranslations True, upped MaxLength and ran the fine-tuning with my patent tmx file again. I used the smallest OPUS model available for EN > DE since I noticed before that the bigger the OPUS corpus is the smaller the non-truncated segments are. My guess is that the larger ratio of long and very long segments present in my patent tmx vs. much smaller segments in the OPUS model makes for that difference in the fine-tuned model. Then I put this fine-tuned model to work with a bunch of long segments of 150 to 400 words. While, at a first glance, they got translated non-truncated the results were still underwhelming. Here are some observations:

TommiNieminen commented 1 month ago

Thanks for your feedback, looks like the long sentence handling should be improved, so I'll mark this as an enhancement for the future.