EticaAI / tico-19-hxltm

[working-draft] Public domain datasets from Translation Initiative for COVID-19 on the format HXLTM (Multilingual Terminology in Humanitarian Language Exchange)
https://tico-19-hxltm.etica.ai
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

TICO-19 ideal data normalization steps to generate HXLTM data #1

Open fititnt opened 3 years ago

fititnt commented 3 years ago

NOTE: While is possible to ingest data directly from existing files, it will be simpler (and maybe relevant) for who would process some of original files directly after some normalization (without actually change contents, but how it is labeled and packed). Since this project is under urgency, we will just document it

At base minimum, this means move files to directories in such way that each directory contains same logical group. But I suspect we will need also to normalize the language codes used. From hxltm-action-example | /data/verum/TICO-19/terminologies-facebook-lint.patch some of the CSVs need manual escape (since translations used ,, but generated output did not escaped it, as per RFC 4180, so it break toolings)

Not surprisingly the data published on the was done as it was created. Also different providers (for example, Google terminologies and Facebook terminologies) used different language codes to express same things.

While Google terminologies only used country codes for specific cases, Facebook done it explicitly on all terms, to a point of when the case of 'no need to specify country', they used '_XX'. It could be just omitted.