digitalfabrik / integreat-cms

Simplified content management back end for the Integreat App - a multilingual information platform for newcomers
https://digitalfabrik.github.io/integreat-cms/
Apache License 2.0
56 stars 35 forks source link

Test whether DeepL/Google translate known examples in expected manner daily #3158

Open PeterNerlich opened 2 weeks ago

PeterNerlich commented 2 weeks ago

This should not be implemented as part of integreat_cms, but rather in our server infrastructure. However, since to develop the proposed tests we need the insight into typical translation content, I'll be borrowing this issue tracker.

Motivation

We currently experience issues with both DeepL and Google serving corrupted translations (such as putting phrases marked with translate="no" at the start of the sentence rather than where they make sense) for various source/target language pairs, which is not something we can influence.

Proposed Solution

Add a daily cron job to

In order to accomplish that, these known examples have to be defined. They should include all important features that we expect in content translations, in enough variations and redundancy that we can be reasonably confident in the result of the reports. I imagine something like:

This assumes that we can decide reasonably well whether deviations occur or not. We likely will need some sort of fuzzy matching, as we might not be able to capture all possible different strings the API might return for any of the known examples that we would regard as valid. If such an automated decision algorithm cannot be found, it might be good enough to post the whole example with the translated version in Mattermost, maybe along with the translation put through the system again and translated back to the source language, and have a human check for deviations every day. This can be sped up by saving results that have been reviewed in the past and not mentioning those previously marked to be a good translation, whenever they get produced by the API again.

Alternatives

User Story

As a service provider I want to know about quirks and problems of my upstream translation services rather quickly so that I can give suggestions to my clients, or at least not have to find out about quirks at a press conference.

Additional Context

https://github.com/digitalfabrik/integreat-cms/pull/3135#issuecomment-2424968026

3157

jarlhengstmengel commented 1 week ago

I think this is a very interesting topic with the potential to blow up a bit. As you write, LLM's are not that deterministic with their results/predictions as might be practical for us to test their consistency regarding our use cases. I'd love to discuss this in more depth.