EticaAI / tico-19-hxltm

[working-draft] Public domain datasets from Translation Initiative for COVID-19 on the format HXLTM (Multilingual Terminology in Humanitarian Language Exchange)
https://tico-19-hxltm.etica.ai
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

TICO-19 paper usage of "translated terminologies" vs the, de facto, "translation of list of words" on the published material #3

Open fititnt opened 2 years ago

fititnt commented 2 years ago

This topic is actually something I took some time to realize, but as we're importing to HXL and planning to export to everything else, including TBX, it needs to be addressed: the majority of "data rows" on TICO-19 terminology donated in good faith by Google and 100% of TICO-19 terminology donated in good faith by Facebook cannot be called terminology.

(This topic don't apply for Translators Without Border collaboration or the ones with at least the part of base minimum annotated by Google)

It also cannot be called "translated terminology" (terms used on the TICO-19 website) because the source content needs to be terminology at first. And, the initial content was, in fact, more near a "list of words". Also note that translating isolated words (or terms, which can be a composition of words) is much more complex than sentences. And by arbitrarily preparing just the words without any context of what that means, don't make it terminology.

Something fair to call these data rows is "translated list of words" or "translated wordlist" (not to be confused with WordNet project). It also cannot be post annotated (e.g. we or someone else corrects what is worth to be properly explained) after the translation. To be called translated terminology it would need to be reviewed again after corrections.

Translated wordlist is tolerable as it allows less enforced quality control and is more aligned with what is on the TICO-19 paper considering what was, de facto, the shared data. By no means I'm saying that it is not useful for such translated word lists (and, In fact, they're more easy to bootstrap under urgency, and can be good enough for first days, maybe weeks), but calling it terminology is doing no good not only for translators (and not for "low resource languages" by every one who was forced to generate to translate terms from English) but also to users and consumers of the end material who may assume the quality control of terminology when is theoretical impossible to do with mere word list.

We when importing to HXLTM will need to split the content. But makes sense to report back to the online material on the https://github.com/tico-19/tico-19.github.io. This still not affect the TICO-19 paper (maybe except for criticism they make the poor understanding of translators while rushing too fast) but do affect costumers of the datasets not provided by the Translators Without Borders.