Helsinki-NLP / OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
MIT License
64 stars 9 forks source link

Finetuning says: "not enough parallel segments in the tmx" #88

Closed HMueller007 closed 10 months ago

HMueller007 commented 10 months ago

Hi,

when I want to fine-tune the model with a TMX from a (Wordfast) project it says: "not enough parallel segments in the TMX".

It has more than 600 bilingual segments (so about 1300 segments in total if you count source and target language segments separately) from a finished project. Is this really not enough? How many do you need?

Khalid-kamal commented 10 months ago

It needs at least 1000 TUs

TommiNieminen commented 10 months ago

Hi, the finetuning needs a bit of data to work on, so there's a minimum requirement of 1000 translation units (pairs of source and target language segments). This is an arbitrary number, and you probably need more than 1000 to have a noticeable effect. If you still want to try it with 600 translation units, you can change the FinetuningSetMinSize setting in the OpusCatMTEngine.exe.config file.

HMueller007 commented 10 months ago

Hi, thanks for the answers @all. I actually tried instead the function to upload a source and a corresponding target file derived from the same TM and it worked, it improved the translations even with this small size. But I might also try this other setting, thank you.

SafeTex commented 10 months ago

Hello HMueller007 and all

What I sometimes do to get around this is to import a simple two column TB (glossary) into memoQ for the same job as the translation job I'm doing and then export all that to the TB for the same job. The segments are small of course but they are very relevant to the job and as Opus does not have any TB function at present to instruct the MT engine, this feels like an intuitive way to proceed. This often gets the TB to exceed the minimum number of segments restriction setting

HMueller007 commented 10 months ago

@SafeTex That's a good tip, will try this, thanks.