Open fititnt opened 2 years ago
Edited (but initial comment is below)
Truth to be told: in addition to lack of tooling to deal with linguistic content (in special terminology), not surprisingly most initial work was done in a hurry AND merging content from different submitters. The folder with Translation Memories (using TMX) was updated over months. Also the work from different providers had different approaches (both ways to document concepts and ways to encode which language the terms are), so this made it not fully compatible. So even any idea of having some file with all terminology would make sense only later.
Anyway, deal with such type of terminology exchange and translation memory exchange for such number of languages is already hardcore. The paper mentions that was a lot of work reviewing the translations themselves (and I think this is mostly from Translators Without Borders), but is clear that even in case of different companies (like Facebook and Google) which could also ask their users to help with translations could at least be aware on how to prepare the additional descriptions to not simple asks the bare term. I'm saying this because if any future initiative with companies try to do crowdsourcing content, they would need some minimum standards so when they exchange with others it would be less complex.
Same reasoning as previous updated commend. I will keep the old comments done yesterday here.
I still think this sound a bit of non-intentional helicopter research, but to improve the quality of data collected and shared, since was done by actually different strategies, would need more planning ahead of time in special for companies where translation/terminology is not even related to their main goal.
Quick proof of concept (merging only 4 languages in 3 different files) here : hxl proxy link. This link may stop working if test files are changed (which is likely).
As long as the working languages are well defined, the hxltmdexml
can convert easily from the TMX to CSV. With HXL proxy humans can merge datasets one by one. But since every column is different (have information about the language, different from the TICO-19 terminology, where the language is a column, so merging all data is just append the CSVs aware that there is heading in each one)
While not crucial to implement the V1 of hxltm-action, this issue will be used to reference strategy used to convert this real dataset as test case for conversion. The end result both could be useful and also help to understand what additional tooling could be relevant.