Helsinki-NLP / OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
MIT License
71 stars 11 forks source link

Translation processing problem #52

Open Khalid-kamal opened 2 years ago

Khalid-kamal commented 2 years ago

When you have a sentence and dots are found in the middle, the sentence cannot be completed and only the first part is translated, ignoring the last portion after dots. for example The officers and employees of the Bank, who are not local nationals of the Kingdom of ................... shall be exempt from customs duties and other levies, prohibitions and restrictions on the importation of motor vehicles and spare parts thereof, and household effects, equipment and furniture. The result comes only for the first part until Kingdom of

SafeTex commented 2 years ago

Hello Khalid

Are you translating into Arabic by any chance?

I wouldn't be surprised if this has something to do with right to left languages but I'm only guessing of course

The thing is that when I tested what you said in one of my language pairs (Swedish to English), Opus CAT translated everything (see attached file) dot translation

Khalid-kamal commented 2 years ago

So, it seems that the problem is in the language you are translating into, but this should not happen since the tool is counting the source words and compare them to the target words. It may be a bug and needs to be fixed. Thanks for your guressing

TommiNieminen commented 2 years ago

I don't seem to be able to reproduce this issue, at least with the opus+bt-2021-04-13 English to Arabic model. Do you have more information in what contexts this issue occurs in?

Khalid-kamal commented 2 years ago

Tommi, Would you try this sentence and see the result: 1996 ................... among certain African states and international organizations; image

Khalid-kamal commented 2 years ago

Here is the database image

TommiNieminen commented 2 years ago

That looks like a fine-tuned model, so it's possible that this caused by the fine-tuning process. Since the data used for fine-tuning is generally very domain-specific, it may cause performance to degrade with source texts that don't belong to the fine-tuning domain (such as these kinds of texts where a series of periods is used as placeholder).

How much data did you use to fine-tune the model with, and what sort of data was it? Another complicating factor is that the Arabic models are multilingual models, i.e. they support multiple variants of Arabic, which might affect fine-tuning.

Khalid-kamal commented 2 years ago

Over one million segments

Khalid-kamal commented 2 years ago

Most of the data is almost in the main domain

TommiNieminen commented 2 years ago

Ok, that's a lot of data. It does sound like the problem with the repeated periods is caused by the fine-tuning. If there are other errors in the translations besides the problem with the repeated periods, I would advise fine-tuning with smaller, more targeted set of segments.

If the model translates OK otherwise, it's also possible to use a pre-edit rule to edit those problematic sentences automatically before they are translated. For instance, you could use a rule like this:

image

This rule would truncate all series of repeated periods to five periods, which might be easier for a MT model to handle.