Open Khalid-kamal opened 2 years ago
Hello Khalid
Are you translating into Arabic by any chance?
I wouldn't be surprised if this has something to do with right to left languages but I'm only guessing of course
The thing is that when I tested what you said in one of my language pairs (Swedish to English), Opus CAT translated everything (see attached file)
So, it seems that the problem is in the language you are translating into, but this should not happen since the tool is counting the source words and compare them to the target words. It may be a bug and needs to be fixed. Thanks for your guressing
I don't seem to be able to reproduce this issue, at least with the opus+bt-2021-04-13 English to Arabic model. Do you have more information in what contexts this issue occurs in?
Tommi, Would you try this sentence and see the result: 1996 ................... among certain African states and international organizations;
Here is the database
That looks like a fine-tuned model, so it's possible that this caused by the fine-tuning process. Since the data used for fine-tuning is generally very domain-specific, it may cause performance to degrade with source texts that don't belong to the fine-tuning domain (such as these kinds of texts where a series of periods is used as placeholder).
How much data did you use to fine-tune the model with, and what sort of data was it? Another complicating factor is that the Arabic models are multilingual models, i.e. they support multiple variants of Arabic, which might affect fine-tuning.
Over one million segments
Most of the data is almost in the main domain
Ok, that's a lot of data. It does sound like the problem with the repeated periods is caused by the fine-tuning. If there are other errors in the translations besides the problem with the repeated periods, I would advise fine-tuning with smaller, more targeted set of segments.
If the model translates OK otherwise, it's also possible to use a pre-edit rule to edit those problematic sentences automatically before they are translated. For instance, you could use a rule like this:
This rule would truncate all series of repeated periods to five periods, which might be easier for a MT model to handle.
When you have a sentence and dots are found in the middle, the sentence cannot be completed and only the first part is translated, ignoring the last portion after dots. for example The officers and employees of the Bank, who are not local nationals of the Kingdom of ................... shall be exempt from customs duties and other levies, prohibitions and restrictions on the importation of motor vehicles and spare parts thereof, and household effects, equipment and furniture. The result comes only for the first part until Kingdom of