EticaAI / lexicographi-sine-finibus

Lexicographī sine fīnibus
The Unlicense
0 stars 0 forks source link

Strategy to encode strict controlled vocabularies (terms in natural language which match stricter translations) #17

Open fititnt opened 2 years ago

fititnt commented 2 years ago

Quick links on the mentioned use case


Context

Currently, after the hard work of conciliating concepts with existing Wikidata Q, we're already able to get over > 100 languages terms. The way the Wikipedia ecosystem works (heavy self moderation) means in general the baseline result already is greater than alternatives. In fact, it is more likely humans using Wikipedia as reference then making corrections would deliver better results than guessing the term translations.

At this moment, we're already able to get these terms, compile, and re-share. They do not receive any special labeling.

Example: specialized use cases of controlled vocabularies

Both Basle Nomina Anatomica (BNA1895) and Terminologia Anatomica (TA98) are great references of controlled natural languages (yes, I'm aware Latin as an dead language is easier to do it) which are know to algo later become translated on several languages by experts association (often at country level).

TA98, despite being the active international reference on human anatomical terminology, has much less translations than the BNA1895. Also, the adoption of TA98 is not as perfect. Even in counties which do have translations (such as Brasil) some researchers such as this one (link link link) complain that the adoption of a stricter Portuguese version of TA98 is moderate.

In general, quite often experts publishing research may still use archaic terms (such as one the BNA1895 would have) even when translations exist.

However, the situation on languages with no official translation at all from TA98 are likely to be somewhat worse.

Important face to the reader: most (but not all) existing terms on BNA1895 were kept on TA98; terms which are not mere addition tend to me better specializations of old body parts or terminology "simplifications'' (which not rare, in my personal honest opinion, were made because English speakers preferred adopt old Geek roots instead of keep Latin roots; I know this is personal rant). Anyway, do exist old books everywhere with stricter anatomical terminology which could be reused for global compilation

Non latin examples

I'm not aware of other nomenclature translations, but if they're exist, are likely to be terminology heavily copyright, likely the ones from ISO.

They are not relevant for our nurse cases, as they're not scientific nomenclature.

The focus on this topic

The idea of this topic is that both have at least one real namespace of dictionaries as practical examples AND make ready the tooling and general documentation on how to encode specialized nomenclature.

Using MVP de [1603:25:1] /partes corporis humani/ #11 as example, we can both encode Latin (and Portuguese) based on stricter reference.

This approach does not exclude usage of terms from Wikidata Q (and, in fact, users could then change Wikidata to adhere to specialized vocabulary). But as we're using a much smaller subset of terms than full BNA1895/TA98 actually is feasible to do it. We can also somewhat have an idea how Wikidata/Wikipedia already diverse from stricter nomenclature.

One problem of creating tags for each regional organization (instead of generic one)

The way the TA98 was released (Latin and English) did not take in account any attempt to centralize international terminology. I'm aware copyright plays a role in this, but we're likely or not, except by Latin + English terms of TA98 released in 2011, pretty much every local language relies on books.

In other words: Wikipedia (Wikidata) without any extra effort, already is the closest to international link of such terminology. We may go a step further to make differentiation (but even this may later be used to correct non-strict terms on Wikidata).

Anyway, even in cases were is possible to create an special attribute for each organization which could validate terminology variants on each natural language know to generated on past either TA98 or (much more common) BNA1895, makes sense to have a common attribute to use in addition to the natural language codes to express that such terms are ones actually endorsed somewhere.

Nomenclature consistency is more important than copyright (and words cannot be copyrighted alone)

In special for nomenclature of anatomy (and our use cases are even fair use of smaller subset) it is unlikely anyone anywhere will oppose open initiatives to ensure consistency. Discussions such as this one here https://www.wikidata.org/wiki/Wikidata:Property_proposal/TA98_Latin_term may give fear for what does not make sense.

The alternative for this would mean reuse archaic terms (which is exactly what terminologists don't want). And the fact we're doing a massive compilation of terminology, is better (when is viable) do not deviate from the latest endorsed terms. It's not just "not wrong", but the right thing to do.

fititnt commented 2 years ago

For sake of both Latin and Portuguese version, here are photos which have likely all the terms we're interested.

I'm also pasting here because later some reviews may not have the books at hand, and, different from the Latin+English, there is no online version.

Note: several languages of the world do not have translations from TA98, but we will investigate what terms we use from TA98 without strict march on BNA1895 (which are less likely to happens). This approach could ensure eventual 100% perfect reuse of still valid BNA1895 translations.


WhatsApp Image 2022-02-04 at 03 23 00

4eb93bd8-0ef2-4b81-92a0-19f8855182d3

e01aa677-e793-4b74-9252-4ec618265766 (1)


bd7ebe7c-adc0-4a22-9206-2d8fdaafbeba

WhatsApp-Image-2022-02-04-at-03-23-00-1- (1) (1)