WeblateOrg / weblate

Web based localization tool with tight version control integration.
https://weblate.org/
GNU General Public License v3.0
4.53k stars 1k forks source link

Improve translation memory management #7346

Open rhofer opened 2 years ago

rhofer commented 2 years ago

Situation

In our weblate self-hosted approach, any translation components are premanently onboarded. In order to benefit cross component / project suggestions e.g. with "automatic suggestions", both default machineries are activated (Weblate: live component look-up, Weblate Translation Memory: TM look-up).

Working on tanslation, one essential aspect is to harmonize terminology in use across various components or even across various projects. For example, we may start with a first translation then later on it is needed to revise it for the sake of harmonized terminology. This results in having old, obsoleted strings enriched in TM, as well as the latest, harmonized and approved string.

Over the time, this leads to a polluted TM, where the machinery Weblate Translation Memory provides outdated, obsolete or (meanwhile) even forbidden strings. In e.g. "automatic suggestion" tab, potential suggestions meanwhile became a mix of old TM results, latest TM results and live results from active components.

Meanwhile, this is heavily puzzling translators and even leads to mistakes in a way, that translators pick outdated terms or even forbidden ones from TM.

Goal

As a translator, I don't want to see a history of text memory. More specifically, I only want to see auto suggestions based on texts which are currently valid and approved translations.

Problem

Today, weblate provides no means to manually or automatically clean up TMs in order to get rid of "old" stuff and hence avoiding translation mistakes if translators base on TM results. Therefore, with translations continuously happening and todays automatic enrichement of TM, the pollution of TM continuously grows.

This issue is collecting options in order to improve TM management. In order to make a specific option implementable, this is/will be carved out to a specific, individual issue.

Option 1: improve global TM management - delete and recreate

This option is described with https://github.com/WeblateOrg/weblate/issues/7347

This would be very helpful in order to counteract TM pollution in a manual way, but with a kind of "mass operation", where clean-up does not happen on individual string base but on full TM scopes.

Option 2: individual deletion of TM entries

This option is described with https://github.com/WeblateOrg/weblate/issues/6440

In selective situations this is helpful. Currently, for my situation, Option 1 would be sufficient.

Option 3: automatic TM maintenance based on review state

This option is not yet put to a individual issue, since this requires discussion first.

Excluded:

Affects:

Pros

Cons

Option 4: provide option to switch off automatic TM enrichement

This optoin is described with https://github.com/WeblateOrg/weblate/issues/7348

In our use case, we primarily build on the Weblate machinery providing a live look-up to all connected components. Once this option is available, we would switch off automatic TM enrichement.

Remark to any options

All this only affects the automatic enrichement of TMs. What still shall be possible (as-is today):

Preferred solution approach

For our use case we are requiring the following options as a best fit for a next improvement step:

github-actions[bot] commented 2 years ago

This issue has been put aside. It is currently unclear if it will ever be implemented as it seems to cover too narrow of a use case or doesn't seem to fit into Weblate.

Please try to clarify the use case or consider proposing something more generic to make it useful to more users.

rhofer commented 2 years ago

/cc @KadAnna

ilocit commented 2 years ago

@nijel I can completely relate to what @rhofer is describing here. We really love Weblate, but the TM management and data "polution" is just very high and there is no feasible way to maintain the TMs. It would be important to have those in order for Weblate to be used in a more reliable way. We are also struggeling with this.

luebbe commented 2 years ago

Issue #6050 also indirectly requests improvements to TM handling

orangesunny commented 2 years ago

Option 4: provide option to switch off automatic TM enrichement

Is halfway done. @rhofer would like to have the option to still have the manual upload TMX as a source of automatic suggestions, even when the TM is off for the project.

It would be nice to not only completely turn off Weblate translation memory, but also make configurable what TMs of other projects I want to use. Also, an option to just turn off the enrichment by the project and still keep the availability of the TM as a source would be great.

Let’s talk about how it should work.

ilocit commented 2 years ago

General remark:

In professional CAT, there are usually various options to manage the population of TMs.

  1. You only store translations in a temporary TM (mostly called a Project TM) until the component and/or project is final.
  2. Only when completed a coordinator decides what to do with the final translations (write to customer TM, Global TM,...)
  3. And within (2.) she has the option to select Overwrite, Merge, Create New entries in the "Master TM".

In 2018 there was a nice "open-sourced" initiative of some companies sitting down to draw up some TM Management best practices. See https://github.com/GILT-Forum/TM-Mgmt-Best-Practices/blob/master/best-practices.md Maybe a source for ideas on how to enhance TM management in weblate.

Keeping TM nice and proper is key to good translation output (and also in order to train MT!)

rhofer commented 2 years ago

@ilocit many thanks for your input and the link. Wasn't aware of this. the .../best-practices.md is definitely worth reading.

Regarding your points 1) to 3), to keep such things in mind as vision, where to professionally arrive at the end in the area of TM mamangement is a great thing. Nevertheless, every small step providing more capabilities in TM management is heavily appreciated. ... I'd perceive a whole-in-one shot to go for the vision directly, isn't feasible.

ilocit commented 2 years ago

Yes, small steps are much appreciated. But keeping a vision in mind, as direction into which the journey is heading. I think it might be worthwhile thinking about the vision first and get things straight. Maybe we, Neil, others don't want our / my vision to be their vision. ;-)

As you had suggested already during our last 1:1, @rhofer , maybe a Weblate UG meet-up would be a nice idea! :-)

YannZeRookie commented 1 year ago

I completely concur with @rhofer 's description of the problem. You just cannot work with "left-overs" that would pollute your knowledge base. I just encountered the problem on our self-hosted Weblate server: it would keep on suggesting entries that belonged to an obsolete Component that had been removed - even though I had specifically deleted the related TM entry in the TM Manager (/memory/ page) - which seems like a bug IMO.

Here is how I was able to resolve the problem: Using the GET /api/memory/ entry point, I downloaded the list of all TM entries and collected the id of the obsolete ones, using a filtering criteria. Then I used the DELETE /api/memory/(int:memory_object_id)/ end-point to remove them one by one.

This was battlefield medicine, but it worked great. I hope this approach helps someone until we have a working Delete button in the TM Manager.

nijel commented 1 year ago

The per-component or project delete and re-create is there (see https://github.com/WeblateOrg/weblate/issues/7347). The individual entries can also be deleted (see https://github.com/WeblateOrg/weblate/issues/6440). If anything is broken on these, please open a separate issue so that we can take a look.