TICO-19 ideal data normalization steps to generate HXLTM data

Started from
- Test case for HXLM-Action: datasets from Translation Initiative for COVID-19 "TICO-19" https://github.com/fititnt/hxltm-action/issues/5
Source data from
- https://github.com/tico-19/tico-19.github.io
Other resources
- Translation Memory eXchange
- https://en.wikipedia.org/wiki/Translation_Memory_eXchange
- https://www.gala-global.org/tmx-14b
- IETF BCP 47 language tag
- https://en.wikipedia.org/wiki/IETF_language_tag
  - BCP 47: https://tools.ietf.org/rfc/bcp/bcp47.txt
- RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files
- https://datatracker.ietf.org/doc/html/rfc4180

NOTE: While is possible to ingest data directly from existing files, it will be simpler (and maybe relevant) for who would process some of original files directly after some normalization (without actually change contents, but how it is labeled and packed). Since this project is under urgency, we will just document it

At base minimum, this means move files to directories in such way that each directory contains same logical group. But I suspect we will need also to normalize the language codes used. From hxltm-action-example | /data/verum/TICO-19/terminologies-facebook-lint.patch some of the CSVs need manual escape (since translations used ,, but generated output did not escaped it, as per RFC 4180, so it break toolings)

Not surprisingly the data published on the was done as it was created. Also different providers (for example, Google terminologies and Facebook terminologies) used different language codes to express same things.

While Google terminologies only used country codes for specific cases, Facebook done it explicitly on all terms, to a point of when the case of 'no need to specify country', they used '_XX'. It could be just omitted.

EticaAI / tico-19-hxltm

TICO-19 ideal data normalization steps to generate HXLTM data #1

https://datatracker.ietf.org/doc/html/rfc4180