long translation segments truncated

mrpi007 commented 2 months ago

Hi, I am a professional freelance translator (en>de) having translated mainly patents for many years. I am also an amateur python coder and have built my own MT-tools to help with my translation work. So far, no other MT system I tried has come close enough to my own search-and-replace/rule-based system to consider switching. Until I recently came across OPUS-CAT, that is. After fine-tuning one of the basic OPUS models with my massive patent TMX file the results are stunning, even scary for someone making a living with translation. I decided to try OPUS-CAT more thoroughly. I tested it with the SDL Trados plugin and found a way to successfully transform a fine-tuned OPUS-CAT model for CTranslate2 in order to use it with my own python code. There were, of course, a number of issues along these routes, which I could solve all but one: Input of long English segments (>60 words, very common in patents) seem to produce consistently translations that appear to be truncated after a certain, but still varying, length/number of words. Although I didn't really know what I was doing, I tried a few different configuration parameter setting for fine-tuning the model as well for using the model with CTranslate2 with no success at all. Hopefully you have some good ideas or pointers on how to make OPUS-CAT digesting English segments of up to 400 words and producing a German result of that full length.

TommiNieminen commented 2 months ago

Hi,

Good to hear OPUS-CAT is working for you. The problems you are having with it are probably because there is an input truncation setting in the Marian NMT decoder that is used in OPUS-CAT. This is normal in NMT, since translating very long sentences is slow, and also the model has not been trained to handle very long sentences, since very long sentences are usually removed from training data. This has been an issue before, and there is a version of OPUS-CAT where there is a functionality for working around the problem of really long input, you can get it from here: https://github.com/Helsinki-NLP/OPUS-CAT/releases/tag/engine_v1.2.4

In this version, long input sentences are split into smaller chunks, and the translations are then merged (this can cause some grammaticality problems in the merging points). Here are instructions for enabling the functionality:

The splitting feature is not on by default, you can enable it in the OpusCatMTEngine.exe.config file (in the same folder as the OPUS-CAT executable). The relevant configuration parameters are the following:

MaxLength: Default value is 200 (this refers to subword units, so it's less than 200 real words).

FixUnbalancedLongTranslations: Default value if False, changing the value to True will enable the splitting feature.

UnbalancedSplitPatterns: This is a list of patterns that will be used to split the source sentence when the translation is significantly shorter than the source text. The splitting algorithm iterates the list from the top, looking for instances of each pattern in the source sentence. When it finds a match or multiple matches for a pattern, it will split the source sentence into two at the location of the centermost match. The two parts are then translated separately, and they can also be recursively split into smaller parts if the translations continue to be significantly shorter than the source text.

UnbalancedSplitMinLength: Default value is 100. This is the minimum source sentence length in subword units for the splitting function to be applied. The motivation for this limit is that for relatively short source sentences the translation might legitimately be much shorter, but in longer sentences the lengths tend to even out.

UnbalancedSplitLengthRatio: Default value is 1.5. This is the ratio of source text length to translation length that determines whether the translation is considered to be too short. So if the source text length is 150, the translation is considered to be too short if its length is 100 or less.

mrpi007 commented 2 months ago

Hi Tommi, Thank you for your detailed reply. I upgraded the OpusCatMTEngine from v1.2.0 to v1.2.4, set FixUnbalancedLongTranslations True, upped MaxLength and ran the fine-tuning with my patent tmx file again. I used the smallest OPUS model available for EN > DE since I noticed before that the bigger the OPUS corpus is the smaller the non-truncated segments are. My guess is that the larger ratio of long and very long segments present in my patent tmx vs. much smaller segments in the OPUS model makes for that difference in the fine-tuned model. Then I put this fine-tuned model to work with a bunch of long segments of 150 to 400 words. While, at a first glance, they got translated non-truncated the results were still underwhelming. Here are some observations:

some of the split shorter segment parts were still truncated
naturally there were those grammaticality problems around the merging points
the post edit rules were applied correctly to some of the segment parts but not to all of them. I tested this with the rule editor and it seems there's also a problem with longer texts.
the UnbalancedSplitLengthRatio to determine if a translation got truncated or not causes trouble. Sometimes (partial) segments get split although not being truncated, other times they are truncated (a bit) but go through because their ratio of source text length to translation length is still OK. This combined results in a toxic MT post-editing mess. There must be better ways. For instance, checking if a certain punctuation mark at the end of the source text is also present at the end of the translated text could give a better clue. Such a mark could also be just added to the source as an indicator and removed from the translation before merging.
UnbalancedSplitPatterns are a good way to minimize those grammaticality problems in the merging points. To be really useful some details must be clarified: in patents for instance "," alone is not a good split pattern because patent texts are usually littered with enumerations. ", wherein" on the other hand would be a perfect split pattern, because it sets a grammatical hard break. So this pattern "will split the source sentence into two at the location of the centermost match. The two parts are then translated separately". To which part will the split pattern itself belong? In this example the "," ideally should go to the first part, whereas "wherein" ideally should go to the second part. I am sure that I could easily identify a sufficient number of such split patterns in my daily workload of patent texts that would break down long source segments into small enough parts to be translated without truncation and without generating grammaticality problems. However, such questions as with ", wherein" would have to be answered.

TommiNieminen commented 1 month ago

Thanks for your feedback, looks like the long sentence handling should be improved, so I'll mark this as an enhancement for the future.

Helsinki-NLP / OPUS-CAT

long translation segments truncated #93