Helsinki-NLP / OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
MIT License
69 stars 11 forks source link

Regex rules and rule collections (including global rules), imports and exports #69

Open SafeTex opened 1 year ago

SafeTex commented 1 year ago

Hello Tommi and all

I have some questions about "good management" practice with regex rules and how to best manage them. Here are some observations - hence the question marks - for you to comment on please to say if I'm right or wrong, is there a better method or that I have misunderstood something.

1) I should NOT create them with a trained MT engine as I have to retrain such engines after every job and I risk losing them when I delete the old trained engine?

2) So I should perhaps save them to a with an installed model that I then use for training but would they be automatically transferred to the new trained MT engine?

3) I started to play around with all this and exported a few rules and collections to see if I could reimport them into another trained MT engine but here I got a real shock as when I looked at the collections/rules that I had saved to my special folder, they had names like "63e8e851-3936-49f8-9c3d-125476b5033e.yml" . And NotePad ++ does not seem to like opening them. It takes a long time but once open, I can see what each file contains. But on reimport, I have to remember or note down what all these files contain and then hunt for the right one(s). This was mind blowing and I only had a few collections/rules on my desktop Furthermore, there was no clear sign in the file as to what installed model the rule/collection had come from (French or Swedish to English). So I guess that I should always put this in the name of the regex rule/collection name and perhaps save them to different sub-folders in future

To resume, this was just a first test run and I hope to get better, but I can see that creating rules and collections might need careful planning and I'm particularly concerned about the long "non-transparent" names individual rules and collections are given, making reimport very confusing

Any advice please?

Thanks in advance

Dave Neve

TommiNieminen commented 1 year ago

Those are good points, I'll have to make the system more friendly with descriptive names.

SafeTex commented 1 year ago

Thanks Tommi

I think I can get around the other problems too if we can add to or create the names. Then we can add stuff like FR (French) and SE (Swedish) as some rules have exactly the same description as they do the same thing, but the regexes themselves are different due to the layout of the languages (such as with numbers)

Do you think this enhancement will be ready by next Monday ? (English humour 😂😂😂)

Have a nice weekend

Dave Neve

martinengelke commented 1 month ago

Any progress?