Test case for HXLM-Action: datasets from Translation Initiative for COVID-19 "TICO-19"

fititnt commented 2 years ago

Webpage: https://tico-19.github.io/
Research paper: https://tico-19.github.io/data/paper/ticopaper.pdf
(fork) https://github.com/fititnt/tico-19.github.io

While not crucial to implement the V1 of hxltm-action, this issue will be used to reference strategy used to convert this real dataset as test case for conversion. The end result both could be useful and also help to understand what additional tooling could be relevant.

TODO: add more context.

fititnt commented 2 years ago

Edited (but initial comment is below)

Truth to be told: in addition to lack of tooling to deal with linguistic content (in special terminology), not surprisingly most initial work was done in a hurry AND merging content from different submitters. The folder with Translation Memories (using TMX) was updated over months. Also the work from different providers had different approaches (both ways to document concepts and ways to encode which language the terms are), so this made it not fully compatible. So even any idea of having some file with all terminology would make sense only later.

Anyway, deal with such type of terminology exchange and translation memory exchange for such number of languages is already hardcore. The paper mentions that was a lot of work reviewing the translations themselves (and I think this is mostly from Translators Without Borders), but is clear that even in case of different companies (like Facebook and Google) which could also ask their users to help with translations could at least be aware on how to prepare the additional descriptions to not simple asks the bare term. I'm saying this because if any future initiative with companies try to do crowdsourcing content, they would need some minimum standards so when they exchange with others it would be less complex.

**Old comment here**

### One visible challenge: TICO-19 does not have any multilingual with more than 2 languages Even without the idea of HXLTM (it does not even existed when TICO-19 started, we're both new to @HXL-CPLP, and I don't remember of this initiative past there, or I would 100% be interested in get in) the data repository does not have any "global" dataset for terminology. Even the terminology is released using bilingual CSVs (I think have some ID for concepts, but the research paper mentions they way the languages were structured make then non-uniform). But bilingual files is not really the best way to organize terminology. My complaint here is that maybe no one guessed about release as **TermBase eXchange (TBX)** (which is supposed to be the de facto industry standard, even if I totally understand it lack A LOT of tooling make hard). #### Possible hypothesis ##### The non-tooling part The research paper mentions both pivot languages and the fact that some languages had more review than others. So, with this alone is likely that some way to merge back some multilingual terminology, what Europe IATE would call reliabilityCode would not be the same. In other words: we could here "merge" the files back together, but the best would be this be done by who worked on the project. Also, they did not stored comments from translators about issues they had, so not just because I saw some of translations (not just Portuguese) have some minor issues (like translate `WHO` to `WHO`, not `OMS`, this on the Facebook/Google datasets), but whoever use this dataset on future may still keep non-perfect translations. Also, very likely the Facebook/Google have more issues because the source text in English already was shitty. I'm saying this because lack of language diversity on such companies make no one aware how hard is to translate terms without any context. ##### The tooling part This compressed README make me suspicious why they did not used TBX >> Use this directory to upload the Translation Memories. >> >> TMX files were created with the Tikal script of the Okapi Framework https://okapiframework.org/wiki/index.php/Tikal_-_Conversion_Commands#Convert_to_TMX_Format Way back there on https://github.com/EticaAI/HXL-Data-Science-file-formats/issues I remember trying to use the okapi to convert CSVs (as the ones on HXL) to TMX and TBXs. The Okapi does not support TBX (https://okapiframework.org/wiki/index.php/Open_Standards#TBX_-_Term_Base_eXchange). And other tool that support TBX, the https://toolkit.translatehouse.org/ (see http://docs.translatehouse.org/projects/translate-toolkit/en/latest/commands/csv2tbx.html) since it's internal format is mono or bilingual, simply the tooling itself cannot handle multilingual terminology. So, even if TBX was, in theory, good to merge the end result, to get translation back from Translators Without Borders, the people who compiled the end project would need to create own tooling, and this means read about entire TBX.

fititnt commented 2 years ago

Same reasoning as previous updated commend. I will keep the old comments done yesterday here.

I still think this sound a bit of non-intentional helicopter research, but to improve the quality of data collected and shared, since was done by actually different strategies, would need more planning ahead of time in special for companies where translation/terminology is not even related to their main goal.

**Old comment here**

I will take some sleep and continue tomorrow (so whatever I'm writing here is not full reviewed). TBX do have less tolling, but unless the source content for data/TM (which may be the CSVs) already was reduced to have none extra relevant information, the use of **Translation Memory eXchange (TMX)** (https://en.wikipedia.org/wiki/Translation_Memory_eXchange) instead of **XLIFF (XML Localization Interchange File Format)** (https://en.wikipedia.org/wiki/XLIFF) to store translations actually is less powerful. Any decent tool focused on translations have support for XLIFF. Even open source ones like https://www.matecat.com/ allow export translators annotations, and even do some automated tagging and checks, and the user can rearrange the source text when it is poor written (which is not hard for source text in English). ### Why not TMX (instead of XLIFF) for translations **TMX is a dead standard**. And store only language pairs still not fix the big picture. It's only supported by most machine learning and etc, because is simple to implement, but is beyond recognition for serious usage on language pairs. I think that the TMXs files where generated later (so, they are not really losing information, because it was already stored first without any more relevant context), but for cases where translations already are stored on XLIFF, TMX is a downgrade. And I'm saying this because is bizarre the idea of trying to apply a lot of machine learning and etc to detect accuracy of translations and not care about how this is collected and stored. If either Facebook or Google employees read this comment on next years, for gods sake, take some time and implement proper parsing of XLIFF. TBX is fair better, but most translations tools would export XLIFF as best output format. ### The not-so-irony relation with TBX and helicopter research - Context: https://en.wikipedia.org/wiki/Neo-colonial_science One reason to implement more support for data that collect more input from translators (instead of just over-reduce to only what machines can understand) is that is it resemble helicopter research > Our QA process revealed that in most cases the problems arose when the translators did not have any medical expertise, which lead them to misunderstand the English source sentence and often opt for sub-par literal or word-for-word translations. This part of the paper seems to ignore the fact that most of these languages, actually the translators were doing word-for-word translations because the languages themselves may never had the term for such concepts. This word-by-word strategy is mentioned for example on Arabic https://www.researchgate.net/publication/272431518_Methods_of_Creating_and_Introducing_New_Terms_in_Arabic_Contributions_from_English-Arabic_Translation. My point here is even the lack of store additional metadata from translators (like their comments) as would be possible with any minimal tool compatible with XLIFF would allow to see why they were failing. > We additionally release the sampled dataset along with detailed error annotations and corrections (...) Although small in size (at most 558 sentences in each translation direction), we hope that releasing these annotations will also invite automatic quality estimation and post-editing research for diverse under-resourced languages. The paper mention somewhere will have at least some annotations. **So maybe I just did not find on the zip files, so 60% of my written rant here can be wrong**. But I don't think they can archive so high accuracy, when even source text have at least punctuation like commas missing.

fititnt commented 2 years ago

Quick proof of concept (merging only 4 languages in 3 different files) here : hxl proxy link. This link may stop working if test files are changed (which is likely).

As long as the working languages are well defined, the hxltmdexml can convert easily from the TMX to CSV. With HXL proxy humans can merge datasets one by one. But since every column is different (have information about the language, different from the TICO-19 terminology, where the language is a column, so merging all data is just append the CSVs aware that there is heading in each one)

Current screenshots

Captura de tela de 2021-11-09 17-47-15

Captura de tela de 2021-11-09 17-47-30

Captura de tela de 2021-11-09 17-42-10

fititnt / hxltm-action

Test case for HXLM-Action: datasets from Translation Initiative for COVID-19 "TICO-19" #5

Current screenshots