EticaAI / hxltm

HXLTM - Multilingual Terminology in Humanitarian Language Exchange.TBX, TMX, XLIFF, UTX, XML, CSV, Excel XLSX, Google Sheets, (...)
https://hxltm.etica.ai
The Unlicense
0 stars 0 forks source link

Potential guidelines to deal with source terms for new translations in case of proofreading/terminology review based on new evidence when compiling multilingual terminology. #12

Open fititnt opened 2 years ago

fititnt commented 2 years ago

For sake of simplified documentation, I think we may create a convention for language attributes to mark proofread (or terminologically reviewed based on translator's feedback) for source terms optimized for cases where they're already cannot be changed even if creators of initial terms would see the complaints as valid. The [2], for example, mentions that at UN, they have a strong system in place to review (even before texts are passed for translators), which is very different from the situation on [3].

I'm not sure how many types of attributes we create, but at least one related to average proofreading (for example, work done by software developers or people copy and pasting from other references that may already be wrong) and another related to when translation are done based on material that may already have context that do explain what the concept means, but the exact term on that language is likely to produce wrong translations.

Considering we using now to encode English, approach similar to BCP47 (but as baseline ISO 639-3 and non optional ISO 15924) the source terms could still be eng-Latn and variant "eng-Latn-x-term1234" where term1234 means the variant. So when exporting translations jobs, the human could try to export the "eng-Latn-x-term1234" and for terms it did not found, it would export from the base eng-Latn. Or XLIFF formats (or spreadsheets) to give to translators could already differentiate what was the official term and the reviewed term.

Examples of use case

Core-Person-Vocab head term "gender" (with definition that mix two concepts)

See also comment https://github.com/SEMICeu/CPOV/issues/12#issuecomment-858667304.

Captura de tela de 2021-11-29 20-06-22

Note: from the point of view of terminology, the fact of already define a term in so generic terms (in special considering that Core-Person-Vocabulary already was supposed to be a planned controlled vocabulary. But even if is intentionally be ambiguous, the way to design the head terms in English would be make a composed term with "OR", as in "Sex or Gender" or "Biological Sex or Gender Identity" The problem of take one head term from one of the "sub concepts" and attach definition of both concepts causes even more confusion.

For example this table by HL7 https://confluence.hl7.org/display/VOC/Gender+Coding+with+International+Data+Exchange+Standards (https://archive.ph/VQR42) already uses term "gender" in English while in German the word is "Biologisches Geschlecht". Captura-de-tela-de-2021-11-29-21-00-31

Note that the old version of 1.0 in addition to use tables from ISO / IEC 5218: 2004 (so, if compilers of HL7 try to find the better English term, they following CPV 1.00 would lead to use gender.) Add to this that the [1] 2017, Interaction of law and language in the EU: Challenges of translating in multilingual environment (17 pages) already mentions the issue with English as working language be more ambiguous than German or French, this is not an isolated case.

Examples of conflicting issues with the head term "gender"

Captura de tela de 2021-11-29 20-31-30

"Gender interacts with but is different from sex, which refers to the different biological and physiological characteristics of females, males and intersex persons, such as chromosomes, hormones and reproductive organs. Gender and sex are related to but different from gender identity." -- WHO

Already in English most references on what the definition used on the preview of Person-Specification, and not just WHO, not only would disagree to put same short head word for both concepts, but most de facto used values for these fields are strongly related to "biological sex" (which already do have terminology for it).

Quick comments on potential strategies to document changes on source terms before prepare translation jobs

Based on another job we're doing to compile HXLTM , the TICO-19 (see https://tico-19.github.io/ and https://github.com/EticaAI/tico-19-hxltm) on this podcast https://www.stitcher.com/show/the-global-podcast-2/episode/episode-14-twb-and-tico-19-project-80576088 the TICO-19 members mentions that use translated versions instead of go directly from English already is relevant. Considering both Spanish and their french version, I somewhat also agree that the translations seems to be less literal than the English source.

But here one thing: even either for TICO-19 (that is very different from CPV, was a project based on urgency) one alternative can also be a proofreading version of the English source.

I think on case of urgency projects, like TICO-19, if we add some feature to label alternative versions of source term, who is preparing the work to distribute for new translations could have more freedom and optimize for speed (dozens of terms, days, if not hours, to take actions). But on case of Core-Person-Vocab, not only because is less terms (but also because there is more planning involved), if we document some additional attribute to justify the change for new variant of source term, we may also document that this would need more metadata (for example, organizations like WHO, that could also back up feedback from translations that the source terms are not aligned with definitions).

fititnt commented 2 years ago

RFC 6497, https://datatracker.ietf.org/doc/html/rfc6497 BCP 47 Extension T - Transformed Content

This document specifies an Extension to BCP 47 that provides subtags for specifying the source language or script of transformed content, including content that has been transliterated, transcribed, or translated, or in some other way influenced by the source. It also provides for additional information used for identification.

Either for potential initiatives like TICO-19 (who, based on all explanations they mentioned about using more than one source language, so an expressive way to store also wich source language or exact what translation was used as source for new versions) but also for ad hoc variants of source term, something like RFC 6497 can be relevant. It may not be as easy to make transparent for users, but do exist some expressive way to explain the source of a language.

  For example:
   +---------------------+---------------------------------------------+
   | Language Tag        | Description                                 |
   +---------------------+---------------------------------------------+
   | ja-t-it             | The content is Japanese, transformed from   |
   |                     | Italian.                                    |
   | ja-Kana-t-it        | The content is Japanese Katakana,           |
   |                     | transformed from Italian.                   |
   | und-Latn-t-und-cyrl | The content is in the Latin script,         |
   |                     | transformed from the Cyrillic script.       |
   +---------------------+---------------------------------------------+

On our case, one translation to Spanish from English (when really want to make sure what the source was) would be spa-Latn-t-eng-Latn, but if it was eng-Latn-x-term1234, the variant would be spa-Latn-t-eng-Latn-x-term1234. There is also the possibility of translations of translations, and depending of the route, this could means different target translation.

There are some corner cases, but since potential proofreading/review is likely to be only from part of terms, some way to track what where the source terms could be relevant. So for example, if later was found that the proposed term was not good, only that part of translation could be invalidated. Another reason is the case of potentially allow export for translation both versions not as text annotation (an XLIFF comment) but two different translations from humans, but this situation is so specific, that instead of creating a new column, who is preparing translations could manually copy and paste the entire concept, and use a different concept code.

I understand that this level of keep track on how terms are generated are overly technical, but since HXLTM (as spreadsheet-like, not the TBX, wich could keep track of more data, but is less supported by tools) already need to be self-sufficient without need of complex frontends, the care on how to label the codes could be much better than most tools already encode terms.

Also, I'm almost sure that even the TBX validator (https://www.tbxinfo.net/tbx-dialects/?id=4) could consider as invalid language the BCP47 language style of Extension T, so if even validator don't have such feature, is very unlikely that other tools would support this compact notation.