UNMigration / HTCDS

Human Trafficking Case Data Standard
Other
18 stars 6 forks source link

Concerns about authoritative versions in Arabic (macrolanguage), Chinese (macrolanguage), French, German, Spanish and Russian from English version of HTDCS 1.0 #23

Open fititnt opened 2 years ago

fititnt commented 2 years ago

TL;DR:

  1. The end of this document has some strong suggestions. But in short:
    1. Is about having some neutral person/group already inside from UN to overview interactions between HTDCS team and translators (this does not need to be open)
    2. The file with the result of translations with additional metadata should be exported. This is both with information that allows reuse of terminology from HTCDS immediately on glossaries while having extra information to explain how it was created. (this end result needs to be open)
  2. IF HTCDS have both terms and definitions exported to the UN Working languages this is a very, very big thing.
    1. Even if the result is provisional, it's still a big thing. Most content related to software tends to be done either in English or, if any, in French. The experts in software at the UN don't talk with UN translators.
    2. Other big thing is open license (bit I will not repeat here) because most projects started at UN if not core feature of the sponsor tend to eventually never be updated again and become "ophan works" in 5, max 10 years.
  3. The proposed suggestions of internal person/group to overview plus final file are both somewhat based on
    1. The fact that the current copyright holder, UN IOM, because it is essential to have immunity of an IGO, can be both complicated for external collaborators (who can be ignored) and internal employees/contractors (who could be punished). The proposed suggestions, in our opinion, at this moment seems to be one way that still has acceptable open standards while still allowing UN IOM to keep final decision to protect its collaborators independence in special without strong technical motivation (like linguistic viability)
    2. Naming conventions about human trafficking is much less likely have dispute by State Members (like names of territories for UN M49, endorsed by UN members states), while still a liguistic issue (likely to be a mix of experts opinion plus language regulators. This extra metadata could be used to receive feedback by these external organizations (including when necessary introduce new terms to languages, and ensure consistency)
    3. The need for multilingual controlled vocabularies, both usable for technology (like data forms and spreadsheets) and glossaries, is a constant urgent need, including for the humanitarian sector and other UN agencies. The HTCDS is just one example. But we know often they work with higher language variety, which means the final shared file must already allow great external help assuming original publishers will be overloaded.
    4. The de facto recommendations start on the "4.1". If there is a terminologist working to merge the translations, only 4.1 and 4.2.1.3 (about how to label languages) are actually the suggestions that are worth reading from here.

1. The big picture

We from HXL-CPLP, aware that the first translations of the HTCDS 1.0 are planned to be released by October (which would mean a but too short time) are concerned because there are some issues we know are hard. This post actually is not a criticism about the current copyright holder, the UN IOM, but potentially existing workflows on how UN Translations is documented to be done and potentially too strict requirements of time to deliver less ideal final results.

Why we care. One reason for us from HXL-CPLP in having IGOs like UN agencies be able to have standards with authoritative translations (even if it is only UN working languages, which excludes for example Portuguese and Hindi) is that alternatives are worse, as they don't tolerate translations at all. For example, the ISO actively DMCA down any serious translation initiative, and not even pandemic allowed exemption; for example the COVID-19 response (with they "freely available in read-only format" only for English/French). I could cite other bad examples, but even vocabularies/taxonomies already not created inside UN agencies but could be started elsewhere, are quite complicated standards that allow Portuguese versions.

Trivia: the world was prepared to exchange data on how to create vaccines (study case: the GISAID https://www.gisaid.org/) but no convention on how information managers could understand how to deploy it efficiently. A lot of vaccines are wasted, including from rich countries, since they actually are not easy to manage. Emergency translation optimized to be used by machines could have a bigger impact on implementers, since it also allows reuse of software or at least speeds up how to implement ideas working in other world regions.

Why is it important to optimize to be faster (when necessary). Maybe the HTCDS is not as critical in matters of hours/days, but in humanitarian areas the need for endorsement for taxonomies/vocabularies would need to be, when necessary, optimized to be fast. One argumentation is minimal conventions for something like fields to share public data related to COVID, there are many others. The implication is to optimize creations (or at least updates for new terms) if endorsed by IGOs without need to wait for lawyers.

2. References about existing translation processes with accredited translations with equal equivalence

Both the UN and European Union are known to publish translations with equivalent authoritative equivalence. These are just quick comments, mostly to resume challenges that even the best translators would encounter (so is not just about problems community translations would face).

This can help to have more empathy, so future translators would not refuse this type of work. I'm very sure such attempts of trying to translate standards have been tried in the past, but HTCDS would be the first ever attempt on translation standards to be used in software (not law or prose documents) inside the UN.

2.1 1980, Evaluation of the Translation Process in the United Nations System (50 pages)

"Because they feel that they are viewed (when they are noticed) as non-creative appendages performing a costly but mechanical report processing function, they can come to view their work as a high-pressure but rather thankless and tedious task."

In case of trying to translate HTCDS using existing UN translations workflow, since except for the additional guidance (like the README.md, Governance and Contributions.md, Guidance.md, etc) the core part that matters is actually very complex. The reason that I put this quote from this document is the better we try to break what can be translated (and prepare the document very well) less likely it would be complicated by first translators.

2.2. 2008, Translation at the United Nations as Specialized Translation (16 pages)

2.2.1 UN already have very strict workflow to translate documents (HTCDS is more specialized, better break concepts)

This document is more close to what is public knowledge, that is the internal translation process inside the UN. It documents the following steps:

This process is optimized for the type of document translated at the UN (which is mostly prose, not concepts or terminology plus descriptions). In other words: the important part of HTCDS is that not even the UN has a translation workflow. I'm not saying that something like HTCDS would need to have all these steps, but note that empathy with translations is a need.

One coping strategy is not to treat HTCDS as average prose text. This means that if the relevant part of HTCDS (the fields and definition of what every concept means) can be broken, this both could allow automation pipelines and, as new concepts are needed, retranslations faster, but very likely early attempts will need a lot of copy pasting with results from translators. The point here is by citing the translation step as one part of a big workflow is important to mitigate translator burnout. For example, if even UN Translators complain about something, it is relevant to take note for future works.

2.2.2 Word equivalence between languages is a myth

One quote from this document:

“Difficulties due to the multi-racial and multilingual characteristics of UN work are regularly encountered by translators. The occasions when one is unable to find equivalents for a word or concept in another language are frequent. For instance, the English words ‘liability’ and ‘responsibility’ have to be translated by the single French word ‘responsabilité’.(...)”

This document also admits that the idea of being viable to have full equivalence between terms is a myth. Actually, even major spoken languages don't have terms, or the existing ones are vague. So, in a context of HTCDS (or potential other works) the best we be prepared for the fact that one or more languages may need to use provisional terms (that often makes one word in one sentence) and, when necessary, have the following long term view: by working with language regulators and providing machine-readable glossaries (so software, search engines, etc etc etc) to explain explain what a new term means, is possible to introduce a new, more specific term, in a language**.

This short explanation about "prefered" vs "provisional" vs "proposed" can be seen as just another row in a table, and also the request at the end of this topic about the need of add definition for the fields of HTCDS may seem as strange, but if the copyright holder allow for those working with language regulators, this allow this type of long term planning.

2.3. 2017, Interaction of law and language in the EU: Challenges of translating in multilingual environment (17 pages)

Quotes:

"English language used in the EU context (...) It is a novel version of the language, often called "EU English" that is different from the English spoken in the UK or Ireland"

"EU legal texts in English very often contain imprecise terms, which is not something one would associate with traditional UK legal language"

The reason for citing this article is, even if the HTCDS standard (I suppose) was written by native speakers, some fields based on the Salesforce software are too vague. They could work for example as a marketing tool, since using more formal names, like "Given name" may be too formal. Also, consider the steps cited on the UN Translations workflow, and would need to have a terminological review.

3. Practical examples

Obviously the HTCDS itself is a common need already on the field of human trafficking . Here I will cite parts of it that are relevant both for it and other humanitarian / human rights usage.

3.1 Generic multilingual vocabulary about person data

FACT: Not only is there a lack of where a software developer could get authoritative translations good enough to be used to collect a person's data with some minimal assurance that information managers (note: is not even end user, is who manage data) could put the same data everywhere majority of the time.

Trust me. I really looked everywhere.

What often happens is someone drafts a piece of software, then only at later stages translations are made again and again. But this is very prone to errors, in particular because often they are done using English terms that even in English are vague. And, to defend software developers who use these vague terms, they often are vague because requirements are vague. Either software developers (which often may use existing references) or translators could make errors.

Under ideal scenario, since form consistency on data collection is essential to mitigate no alternative but reduce dependency biometry (or, if the intention is not get full name, ensure all translations would not require full name), one approach that makes sense would be... have curated translations (here as multilingual controlled vocabulary, where a developer (whatever is English speaker or French, or Spanish, etc etc) if select one term, is granted that translations have higher granted on bigger range of translations; also note that very often the humanitarian operations do occur on places that have languages with low number of global speakers or very specific dialects, which makes this a need to get right for everyone). Note that this simplifies even assentements of desired level of details based on privacy requirement and potentially actions based on data processing.

3.1.1 To UN Translators trying to create versions derived from the person's vocabulary of HTCDS

The text that would be written here was published here https://github.com/UNMigration/HTCDS/issues/22#issuecomment-931752667

4.1 Recommendations from HXL-CPLP to be considered on the first more than translations of HTCDS to UN working languages

Note 1: we from HXL-CPLP are concerned only with translations of the core concepts (like the field names, definition of each field name without including format of field, and potentially user readable labels of fields that are standalized) that end user adding/editing data could see. This is the part we're more interested in helping, since that is what allows translations and generation later of glossaries etc. We have no suggestion at all about translations of any additional content (like the README.md, Governance and Contributions.md, Guidance.md, etc).

4.1 Suggestions on management part (no need to be exposed)

TL;DR: The HTCDS does not fit on the traditional model used on document processing of UN, but someone could act as quality control or internal ombudsman for UN translators on this project.

  1. Have one person/group who could work to overview/intermediate communication from the HTCDS team compiling the translations and the UN translators (including if outsourced) from the first contact
    1. The person/group with this role tends to be perfect to allow other more-than-translations inside UN than HTCDS. Even if as an observer, it is worth the effort to have it.
    2. Is desirable that such a person/group does not have any potential conflict of interest with HTCDS (but doesn't mean someone outside the UN or need to full bureaucracy).
      1. In case HTCDS try to rush time, or pressure translators to something, have power to intermediate while protecting the more-than-translator.
      2. When 2 or more of these more-than-translators have conflict, such person/group could solve it
    3. This person/group doesn't necessarily need to be who "compiles" the result.
  2. In addition to the more-than-translations, even if the source document is in English, it is relevant that the translation process has alternative to an reviewned version in English.
    1. Reasoning: note that the United Nations workflow already considers review of source documents, so is reasonable the person/group agreed to overview/intermediate also have power to decide with translators (or dedicated terminological review) terms changes on terms already published on HTCDS 1.0
    2. If the contents of this extra column become equal to the reviewed HTCDS, the publishers of HTCDS don't need to expose it to the public. One valid reason for HTCDS not to merge it is that proposals of the core terms (like the ones used to label the fields) could make new versions of HTCDS backward incompatible.

4.2 Suggestions on the exported file that is able to be reused (this is what is exposed)

Note: if the person/group who could work to overview/intermediate communication already has experience with terminology:

  1. the entire 4.2 about exported files here can be ignored, but the part about how to label the languages is still relevant (we plan to scale up translations while trying to make different language regulators agree on language, so lang alpha-2 complicates it). Whatever format with more metadata that makes your internal work, if necessary we form HXL-CPLP create software to export to our needs. Actually, even if you export the entire thing in DOCX (but have more metadata) we would be willing to do the copy pasting.

  2. If want reference of similar work, the OCHA Taxonomy As Service publish the Countries & Territories Taxonomy MVP https://docs.google.com/spreadsheets/d/1NjSI2LaS3SqbgYc0HdD8oIb7lofGtiHgoKKATCpwVdY/edit#gid=1088874596. But the file from HTCDS needs to be more close to like Europe IATE https://iate.europa.eu/fields-explained, because the terms would be created/edited both by you and later by volunteers in other languages. And we would need a lot of metadata with so many files exchanged.

  3. One (as 2021-10-01) not yet updated for term+concept, based on the HTCDS 0.2 we use from HXL-CPLP, is here https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1292720422. This file is still not updated by placing together term+concept (as we suggest done here, but we even have a software to convert these HXL files to TBX, XLIFF, etc at https://hdp.etica.ai/hxltm/. Whatever would be the final format created by HTCDS with translations (we recommend that be the easiest for you to manage, even if manual), we will either use software or human copy pasting to put on spreadsheets like these.

4.2.1. Opinionated suggestions on what put on the file shared with HTCDS versions in other languages

4.2.1.1. Crash course on what is Concept, Language, and Term

terminator

source of image: https://terminator.readthedocs.io/en/latest/_images/TBX_termEntry_structure.png

  1. "Concept ID" is necessary to group concepts inside the multilingual glossary
  2. The very bare minimum for is the "term" and what language this is related     1. Some languages (like not translated yet) can be empty.     2. Each language can have several terms; the additional metadata explain what it is, like to differentiate each other     3. If using tabular format, when having several terms, (like preferred, admitted, etc) is better to place the best to use first. This allows some softwares that doesn't understand the table to consider just the first head term for the first concept. This is the UTX approach     4. The most ideal "Term" could be extracted from this table to generate the spreadsheet headings for each language
  3. Concept definition in the natural language is the next most relevant information.     1. Each language can have only ONE definition per concept. (But can have several terms)     2. Definitions of the same concept in different natural languages don't need to be strictly literal translation, but they cannot be different at the point of representing different concepts     3. Each language that already has a well crafted definition is immediately available to be exported as a glossary (like to generate documentation, an e-book, etc)     4. If for immediate use a language has a provisional term (like a long sentence) because there is a lack of better terminology, the addition of another term as "proposed" should be done on the same concept (not by creating other concepts). The export to generate the glossary to help spread new terms can be done with software.     5. The best way to create new translations is the human relying on the definitions instead of only a term (solves ambiguity)     6. This concept
  4. Each term can have some way to express "how good it is"; one approach     1. TBX+IATE is a great reference for having a field for reliabilityCode with unicode digits result and administrativeStatus (Preferred, Admitted, Deprecated, Obsolete, Proposed).
  5. Whatever the format (Spreadsheet, raw XML editing) makes sense to create custom fields for feedback of translators     1. If there is something specific about one suggested term, it's term-level information     2. If it is about several terms in one language, or comments about problems with the viability of the concept (including problems related to converting data from one format to another), then the field is language-level     4. If the Information is concept level AND language neutral, for example, codes equivalence in other glossaries, links to external references, etc then the custom field is concept-level.

4.2.1.2. Note on representation of reliability and administrative statues

  1. One industry standard of sharing terminology, the TBX, and the biggest reference on public collaborative terminology, the IATE (see deeper explanation here https://iate.europa.eu/fields-explained) have a numerical way to explain how reliable a term is for a definition. We strongly suggest both versions endorsed by HTCDS and community ones try to follow it faithfully and document it well, so future collaborators be more effective     1. reliabilityCode is term-level. Means how theoretically the endorsement available makes it faithful. It could be less. It could be more. But the way to attribute it has objective procedure     2. The IATE uses a scale between 1 to 10, where 6/10 (minimum reliability, 2 of 4 stars) is the most common and already tends to be acceptable. The next is 9/10 (Reliable 3 of 4 stars)     3. 6/10 is the default value a native speaker could attribute for their own suggestions. This means both creators themselves of HTCDS and UN Translators, without giving sufficient context on why each term is the best to represent that idea, would have a 6/10.     4. The "sufficient context" means that each term, already on the published file, has more metadata that proves the term is representative. Often this means link to external sources already referenced on the subject AND with authority on the language
  2. For the concepts related to at least human trafficking, assuming that the International Organization of Migration can be reference on the subject at world level (AND for any natural language, as one UN agency), IOM have power to go straight 9/10.     1. Either for good or for bad, this actually allows IOM to be the primary source without need to justify. That's simple. Even a web page with a simple FAQ or glossary allows it to be a citable source, unless per term there is reference of lower reliability, the default now would assume 9/10.     2. This also applies to translations. Europe IATE often has English/French with 9/10 (because monolingual glossaries) while user contributed translations like for Spanish go at max 6/10.     3. However concepts like person name, since they are too generic, unless there is external reference with relevance with a term that matches the concept, don't make sense to annotate higher reliability.
  3. Differences on how natural languages are de facto promoted make a huge difference on the viability of endorse terminology without the need of expert organizations like IOM    1. One extreme example of less likely to have agreement ever using this strategy is actually... English. Other languages often have multi country organizations able even to influence governmental policies to control language evolution or are restrict to small regions (so agreement requires less organizations)    2. One potential implication is, even with HTCDS published with 6 UN working languages, external translations with more detailed explanation of how new terms were formed could eventually allow endorsement even for generic terms.         1. Note the relevance of "Proposed" (in addition to the term to be used immediately); this could means even changes on educational education material for learners of the language

4.2.1.3. A note on code used to represent the languages

  1. Together with terms, the way to express language is important not only for interoperability, but to reduce problems when collecting terminology where actually several language regulators agree.     1. This is why we suggest, instead of "IETF BCP 47 language" style, use ISO 639-3 + ISO 15924 like ar: "ara-Arab", en: "eng-Latn", fr: "fra-Latn", ru: "rus-Cyrl", zh: "zho-Hans", es: "spa-Latn".     2. The total replacement of ISO 639-1 alpha-2 for ISO 639-3 both helps with languages that never got an alpha-2 code (like the code of the biggest minority of Europe) and because when using dialects, there is no need to use country codes     3. The use of ISO 15924 (writing system), while having the advantage of reducing bias of what would be default, also helps to make non-native speakers know the difference between alphabets just by looking at how it is labeled.
  2. We strongly suggest that if translators are already able to provide transliterations at least for terms (no need transliterate also definitions), they shouldn't be blocked by limitation of the distributed file.     1. Example: Hanyu Pinyin ("zho-Latn-pinyin" ?) is quite a popular example, especially for learners.

That was it!