Low quality of Chinese to English

nstjhp commented 1 week ago

Hi - I am excited to try your program as it was recommended to me by a Uni Professor for a fine-tuning project I want to work on.

I am using Linux v1.3.1-beta. I want to translate Chinese to English. Hopefully I installed everything correctly but after I put in some text to test it I immediately get the error in red:

Here should be the relevant bit of the log file in ~/.local/share/opuscat/logs

2024-10-20 20:11:40.246 +02:00 [ERR] System.InvalidOperationException: Sequence contains more than one matching element
   at System.Linq.ThrowHelper.ThrowMoreThanOneMatchException()
   at System.Linq.Enumerable.TryGetSingle[TSource](IEnumerable`1 source, Func`2 predicate, Boolean& found)
   at System.Linq.Enumerable.SingleOrDefault[TSource](IEnumerable`1 source, Func`2 predicate)
   at OpusCatMtEngine.MTModel.Translate(String input, IsoLanguage sourceLang, IsoLanguage targetLang, Boolean applyEditRules, Boolean applyTerminology) in D:\Users\niemi\source\repos\OPUS-CAT\OpusCatMTEngineCore\MTModel.cs:line 369
   at OpusCatMtEngine.TranslateView.OrderTranslation(String source) in D:\Users\niemi\source\repos\OPUS-CAT\OpusCatMTEngineCore\UI\TranslateView.axaml.cs:line 232
   at OpusCatMtEngine.TranslateView.<>c__DisplayClass40_0.<TranslateButtonClick>b__0() in D:\Users\niemi\source\repos\OPUS-CAT\OpusCatMTEngineCore\UI\TranslateView.axaml.cs:line 198
   at System.Threading.Tasks.Task`1.InnerInvoke()
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
--- End of stack trace from previous location ---
   at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(Task& currentTaskSlot, Thread threadPoolThread)
--- End of stack trace from previous location ---
   at OpusCatMtEngine.TranslateView.TranslateButtonClick(Object sender, RoutedEventArgs e) in D:\Users\niemi\source\repos\OPUS-CAT\OpusCatMTEngineCore\UI\TranslateView.axaml.cs:line 198

Thank you for any ideas on a fix!

TommiNieminen commented 1 week ago

Hi,

The problem seems to be this particular model, _opusTCv20210807+nopar+ft95-sepvoc_transformer-small-align2023-03-16. It appears to be a transformer-tiny model, and I haven't tested those with OPUS-CAT (looks like they don't work). Normally those models are hidden in the model downloader, but this model is for some reason named transformer-small instead of transformer-tiny, so it's not picked up by hiding function. Usually the performance of the tiny models is pretty bad, so it's not a priority to add support for them.

Could you test with another model to see if it works, for instance the one selected below (opus+bt-2021-04-30):

nstjhp commented 1 week ago

Thanks for the swift reply - yes no error with the model you have recommended! Unfortunately the translation quality looks pretty poor but I'll try to see if the fine-tuning helps. Cheers

TommiNieminen commented 1 week ago

(I'm reopening this as an issue about the low quality of the Chinese models.)

Would you mind adding a few comments on the types of errors you see with the Chinese model? We don't have the resources to properly validate the quality of all the models, especially for non-European languages and for languages for which there are no widely available public test sets that can be used for comparison.

Currently all the Chinese models are multilingual, and this might have some effect on the quality, I wonder if training a bilingual Mandarin to English model would help (although maybe not, since the data is overwhelmingly Mandarin already in the multilingual models). Also, the bulk of the data we have for training Chinese models (https://opus.nlpl.eu/results/zh&en/corpus-result-table) is from the CCMatrix corpus, which is crawled data and has lots of problems. There's a decent amount of UN data, which might be of better quality, so possibly oversampling that might help.

If you see any improvement with fine-tuning, I'd appreciate if you could mention it in this thread.

nstjhp commented 1 week ago

Sure I can try to help. I have tried the fine-tuning with a dataset I made of a side-by-side ZH-EN fan translation of a Chinese book that was never fully translated (unfortunately for me). It's not exactly sentence-to-sentence, it's more dynamic equivalence. But the translator was thorough and I think copied the paragraph structure of the original so I was able to do the matching if not per sentence, then per paragraph. This dataset is around 7700 lines, or 290000 English words / 400000 Chinese characters, which is quite small compared with others in your resources.

Now I have tested on new text from another fan translation of a different book (one can find the ZH of that text here and that fan translation here). I regard the fan translations mentioned as the gold standard.

Here are the first few sentences on the default model that I can screenshot as said this is pretty poor, even whole passages are missing! Now the fine-tuned version At least there are no missing sections!

Here is another example from the 5th/6th paragraphs:

compared with

Again the fine-tuned one is complete, although the understandability is clearly worse than even a Google translate of the Chinese page. I am surprised at the improvement with what I think is a rather limited fine-tuning dataset although very much in the same domain (wuxia novels). If there are some formal quality metrics I should report, or if you need more info, let me know.

TommiNieminen commented 1 week ago

Thanks for pointing this out. Something must be wrong with that base model, those omissions should not happen. This is probably due to some corruption in the training data, which is then overridden by the fine-tuning, which explains the big quality change. It also looks like the OpenSubtitles corpus might not have been used in training the models, that has a lot of Chinese in it. I will try to investigate this further when I have time, having decent Chinese models would be nice.

Mozilla is also training models that could potentially be deployed with OPUS-CAT, and they have a Chinese-English model under development: https://github.com/mozilla/firefox-translations-models/tree/main/models/dev/zhen. So rather than retraining the zh-en model, it might make more sense to add the Mozilla models to OPUS-CAT, in case the Mozilla model is clearly better.

nstjhp commented 1 week ago

How easy is it to add the Mozilla model, or any other, to OPUS-CAT? I am happy to try the same testing examples with their dev model or their production one when that happens. I don't know enough about formally evaluating the models, but I imagine that if the opus zh-en model is corrupted somehow and maybe missing a large dataset in its training then it could still be worth retraining it to compare (but how?) with the Mozilla model. Cheers

TommiNieminen commented 1 week ago

The Mozilla models are Marian models like the other ones in OPUS-CAT, so it's not that difficult, but still requires a bunch of changes to the downloader etc. I don't expect to have time to work on it in until next year, when I will be changing the model downloader in any case.

If you want to test the Mozilla model, it is supposed to be available in the Firefox Nightly build (https://www.mozilla.org/en-US/firefox/133.0a1/releasenotes/), but I couldn't get it working myself.

nstjhp commented 1 week ago

Understood, if you get time next year to work on it and update this issue I will try to compare them.

Thanks for the tip - I don't suppose it will work for me if it didn't for you but I'll give it a go!

Helsinki-NLP / OPUS-CAT

Low quality of Chinese to English #107