EticaAI / tico-19-hxltm

[working-draft] Public domain datasets from Translation Initiative for COVID-19 on the format HXLTM (Multilingual Terminology in Humanitarian Language Exchange)
https://tico-19-hxltm.etica.ai
Creative Commons Zero v1.0 Universal
0 stars 0 forks source link

Review and annotate likely wrong language codes #4

Open fititnt opened 2 years ago

fititnt commented 2 years ago

At least a small part of language codes (which are very important, since they are necessary to explain what entire community submissions or professional translators are doing) are not only malformatted, but wrong. Since this is not a mere conversion that can be automated, we need to explain before republish.

Some of these malformatted codes are es-LA (Spanish as in Laos) and ar-AR ("Arabic something" as in Argentina)

How we could do it

This is a complex subject. The amount of translations is so high that ourselves could introduce new bugs (the same way TICO-19 eventually published then), so we could even draft (or call others to help with command line tools) just to find types of errors that are common mistakes. But on short term, at least we need to document it as different issue.

So, on this point, despite be serious issue, we can't blame the submitters of data to TICO-19 initiative. We should assume that under urgency and data exchange they are even more likely to occur.

fititnt commented 2 years ago

Okay. I'm moving a bit of the logic to also generate CSVs that expose the conversion of the languages. With asciidoctor, is possible to import these tables to PDF and the web documentation. This already will be used later on #2. Except ebooks, these would need create images from the tables :|

Captura de tela de 2021-11-20 06-35-44

Anyway, I still need some strategy to get CLDR information to print on those tables without require the full java pipeline. There is not really a lot of cli doing that, so we may need to do some scripting. The use case for this would be allow get the translations of the languages and potentially check if the countries are what was marked there. I mean, people using es-LA (Spanish as in Laos) and ar-AR ("Arabic something" as in Argentina) could have at least some way to check it without need human attention.

fititnt commented 2 years ago

Ok. Here's something. It will take more time since I'm preparing extra reusable tools.

Actually I'm not only doing only this for TICO-19 (but it helps as is real test cases, even with more people watching issues happened), but I myself when dealing directly with data on HXL tables (aka directly on Google Sheets and Excel) this type of mistake could happen even more as is raw direct access do data (not someone's service). While do exist tools who can manage sources like CLDR data as library (more common Java, NodeJS, Python and PHP, but this is very advanced to deal with all those XMLS; the langcodes is already abstracting part of the work) my idea is expose as CLI. So it doesn't matter that much which software and also can be used on data pipelines.

The CLDR provides, in addition to translations related to localization, some ways to calculate statistics about speakers per language per country. This is obviously not as exact, but it can help mark codes which are well formatted but may be good candidates for human review. For example, 0 people who are able to speak or write in a language code combination language even without trying to brute force if the terms themselves are not wrong can save from basic typography human errors.

Captura de tela de 2021-11-21 11-45-28

Note: the communitas.litteratum is approximated number of spekaers (or users of sign language) and communitas.scrībendum people are to write. Names and output may change, but the data is from CLDR.

On the aid to convert language codes

Anyway, the additional use for this, while giving hints for mislabeled language codes, actually helps me map the additional language codes we would use on HXLTM tables. For example what they call en it would be at minimum +i_en+i_eng+is_Latn (maybe even glottolog, while have even more detailed code) so any content on HXL becomes easier to process as the HXL attributes allow to query. This may seem simpler with English, but, every one of the over 205 combinations would need to be labeled manually in case of a project like TICO-19. (the number is this huge because some a lot redundant countries, yet this need review)

This is one of the reasons I will try to make a tool to convert from (assuming an already perfectly used code) as BCP47 and then create the HXL language attributes that could be inferred. If something like glottolog or other more exact language codes be added, even optimistic scenarios would still need human review, but at least this stress is reduced for less languages. For codes already not perfect, this means give some help with human error.

On the automation to "detect" the right language (not just mislabeled codes without awareness of content)

Actually, some libraries exist (even in Python) that brute force natural language detection so it could extra human errors by providing both the language code and samples of text. But as boring as it may seem I'm saying this, for something related to deciding which codes to use to create a dictionary for others, it at best could be used for quick tests for human errors.

Also, I'm actually concerned that past and future models may be trained and labeled with wrong language codes (and, obviously, too new natural languages would be unknown to any detection solution). So trying to put on the same tool to help those who do lexicography such detections could worsen the situation.

To be fair, scenarios were this is always relevant would be to test software bugs (think human do the right command, but software uses wrong codes or swap languages) or someone who is publishing data for others who may already be assumed to be credible, but the person who approves cannot even read the script (but could be the last resort of error checking, assuming collaborators could be exhausted under emergency responses).