EticaAI / lexicographi-sine-finibus

Lexicographī sine fīnibus
The Unlicense
0 stars 0 forks source link

[`1603:1:51`] /Dictiōnāria Linguārum ad MMXXII ex Numerordĭnātĭo/@lat-Latn #9

Open fititnt opened 2 years ago

fititnt commented 2 years ago

While we could pack several external existing language codes for data exchange, we will definely use some languages much more heavily. Also, some data source providers can actually use non-standard codes, so soon or later we would need to do this.

fititnt commented 2 years ago

Status quo

Captura de tela de 2022-01-22 18-30-57

Link: https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=272891124

The current working draft have only near 20 languages. And even manually add/review already take some time (and we're not even on the less know languages). Add to this that there are over 300 languages, and definitely no less than top 100 would quite often already have content for concepts we do not create from near scratch.

Why?

Turns out that this table will be quite important. The way we use allows to multiply manually tagged concepts with existing Wikidata terms. While it may have human errors, in general Wikipedia have quite decent self moderation. This article Property Label Stability in Wikidata (https://dl.acm.org/doi/fullHtml/10.1145/3184558.3191643) can give an idea.

The linking back problem

One of the usages (for us here) in addition to compile concepts, when relevant tag existing Q/P/Ls from Wikidata. For things related to humanitarian sector, it actually quite frequent already have a lot of translations.

Use case [1603:45:1]

However, even for something such important such as the "UN System" (https://www.un.org/en/about-us/un-system) neither Wikidata interliks nor whoever keep track of smaller organizations inside UN may have near full picture of anything not big enough. So our questions on last year about where to fin existing translations (even in Spanish) of humanitarian organizations names actually is more complicated since even monolingual descriptions of each organization may not exist (or exist, but not shared/updated).

In addition to that page, there is this PDF https://www.un.org/sites/un2.un.org/files/un_system_chart.pdf which mentions

This Chart is a reflection of the functional organization of the United Nations System and for informational purposes only. It does not include all offices or entities of the United Nations System.

The draft at https://docs.google.com/spreadsheets/d/1ih3ouvx_n8W5ntNcYBqoyZ2NRMdaA0LRg5F9mGriZm4/edit#gid=1894917893 (while have at least one level of division) intentionally already not reuse the numbers for organizations which could eventually be moved from one place to another. I'm saying this because the way Wikipedia properties are organized (for topics related to huge UN organizations) they already are not 100% aligned with published official flowcharts. So already there is a likehood of eventually they be moved from one place to another. We could "keep" the original ID (such as what happens with scientific nomenclature when moving animals from one group to another) while still usable in long term.

There types of features may be something we document and leave it later for collaborators inside these organizations. For example as soon as we start to find more concepts than they know, it would be simpler for someone else update some main table and they would get everyones else translations.

fititnt commented 2 years ago

1603_45_1.wikiq.tm.csv

fititnt commented 2 years ago

To allow the Generic tooling for explain files of published dictionaries (file validation; human explanation) #12 to be fully automatable, we're also adding explicitly some conventions on language codes to be used for what actually is concept code (an strict identifier)

Captura de tela de 2022-01-27 22-04-31

Open points

The +i_qcc+is_zxxx (BCP47 qcc-Zxxx) is used for concept code where what type of concept is explained on +ix_(...) (BCP 47 qcc-Zxxx-x-(...)). but there are some cases

fititnt commented 2 years ago

At this moment, we have around 50 languages prepared. Some san-Zzzz, msa-Zzzz and zho-Zzzz need review, but already is possible to get an idea.

However, just to get existing terms on [1603:45:1] we would still need at least more 150 languages added (after careful review). Adding new languages do not take days (several can be done at once in one hour) but is something that needs patience as what sometimes Wikipedia is calling by one code is not strictly what we would document_.

Points of improvement discovered

On Cōdex (PDF files) should display characters of all languages it contains #13 the amount of languages is such, that we already need to document how users could check then when accessing directly the CSVs and XMLs, and also we need to embed the founts on PDFs. Otherwise, people will not be able at all to render the translations.

New repository tags

Label reconciliatio-erga-verba (link: https://github.com/EticaAI/multilingual-lexicography/labels/reconciliatio-erga-verba) is the done we will use for issues related to language terms reconciliation with the concepts

Captura de tela de 2022-02-04 02-04-44

Currently, the only topic on this was the Wikidata MVP.

The praeparatio-ex-codex (link https://github.com/EticaAI/multilingual-lexicography/labels/praeparatio-ex-codex) is focused on the Cōdex preparation. Every Cōdex have a hard dependency on [´1603:1:51'] (this issue here). Other hard dependencies will happens, but the dictionary which explain what each language is obviously is necessary.

We still not have a main dictionary to explain the non linguistic concept attributes

fititnt commented 2 years ago

Example files (without even finished):


Hard work pays off.

The point here is that the way dictionaries are done, preparing new ones is much, much faster, and it is possible end documentation not only have language terms (in this case could go up as 227 for a concept) but also how others can review. Latin terms are so hard that this means we keep them at minimum, so over time we could automate Cōdex for other writing systems (including using other numbers than 0123456789) as it will be easier for others.

Obviously there are several strategies to compile the translations for concepts, so 1603:45:1 is easier to bootstrapping (also the nature of the concepts means the major ones are highly reviewed/protected). But compared to early days of HXLTM is easier to not only have a few terminological translations, but have over 100s (and with continuous improvements) while becoming easier to bootstrapping new ones in days (actually just few hours) based on new needs.

🙂


Captura-de-tela-de-2022-02-10-08-18-35 (1)


Anyway, there are a few dozen dictionaries which are viable to compile/document without starting specific translation initiatives. And they're already encyclopedic language variants (which could be reviewed, but the baseline already is good).  As strange as it may seem, this approach scales better than the data cleaning necessary on the https://github.com/EticaAI/tico-19-hxltm "terminologies". I mean, while HXLTM itself is documented to anyone scale translations, we're optimizing further to do the bootstrapping ourselves. Something like the terminologies of TICO-19 (and the way Google/Facebook translated with paid professional translators) would unlikely to be as efficient as the way we could do it using only volunteers.

I'm not saying this to compete with way Google/Facebook done TICO-19, but to mitigate errors we could do ourselves if doing something similar at scale. Also both Google/Facebook complained about lack of source content with open licenses, so the best course of action already would be create content instead of use from others.

fititnt commented 2 years ago

We're adding so manu languages recently that SPARQL backend queries may be timeouting again or some sort of error. Better breaking again.

Last time we asked all Q itens, but break in 3 batches the language translations. Then these languages are merged using HXL Standard cli tools.

fititnt commented 2 years ago

Fantastic! With a few exceptions (some languages which have subtags, be-tarask, en-simple, fr-x-nrm, jv-x-bms) we used Q1065 (https://www.wikidata.org/wiki/Q1065) as reference of existing translations were been missing synchronizing. That was a massive work add over 100 items (reviewening one by one, including annotations of what terms still need translations to Latin, what is macrolanguage, etc) since we need make easier for implementers

On the screenshot, except by the number of pages (we're using A5 'pocket format', not A4, so it doubles the pages) and additional concept quantity, it is possible to compare the difference.

We have more than total options for Q1065 because the Cōdex:


Captura-de-tela-de-2022-03-16-06-13-40 (1)


Next steps

We could at least add language for what has at least a first page of Wikipedia (I think it would be around 300, not far from what we have now). This is still not every option Wikidada could provide. However, we can simply add a small number of languages as people get interested. Is not a problem to add languages which are not yet on Wikidata.

How relevant this table is

For now we're using Wikimedia, but we can interlink with potential other sources. But the automation here allows us to get more and more efficient. However, particular for macrolanguages there's A LOT of missing codes and they will need a lot of discussion, since we could have more volunteers than well documented codes to share their work.

The work behind the Cōdex [1603:1:51] //Dictiōnāria Linguārum// was both to explain what exist, and make easier for new ones.

The dictionaries are getting bigger while allowing structured translation Initiatives

What we're doing is very technical. It's so specialized (both how hardcore is to glue the the technological part AND understanding about the languages) that is allowing we be very efficient to bridge the people willing to help with humanitarian and human rights in general. There's far more people whilling to help with causes than capacity to deal with their contributions in a way that is very shareable and reusable.

Even without call to actions we already have decent compilations. But the ideal use case would be to document how people could add translations via Wikidata (without need to create Wikipedia pages). This would start to fill a lot of gaps. Most people know how to help on Wikipedia, but Wikidata (unless for concepts which a lot of visibility, such as the Q1065) is quite friendly to new translations.

Our pending submission on The Humanitarian Data Exchange is not incompetence from our side

While we still waiting to The Humanitarian Data Exchange (https://data.humdata.org/) allow us @HXL-CPLP / @EticaAI be accepted (like they're supposed to do) in the meantime most features will tend to be related to make easier to humans make corrections on already encyclopedic-level content. I mean: we already are optimizing for what comes later, but while waiting, yes, we're taking notes on how hard is to be accepted. Under no circumstances we will accept any sort of organization from global north subjecting us to any type of partnerships with actually are more harmful than actually care about affected people just because they are allowed to share the work we're have no explanation yet why is not considered humanitarian.

By the way, is we're not just "dumping" Wikidata labels, but doing research on areas that need to be focused (and I mean not only our discussions on https://github.com/SEMICeu/Core-Person-Vocabulary, there's much more going on) and preparing concepts and documentation is far advanced to what would be expected even from entire initiatives which typically would only share final work in English (likely only PDF format). We're not just providing terminology translations to 100's of languages in machine readable format, but also even in English, since international community fail even on the basics of data interoperability.