EticaAI / lexicographi-sine-finibus

Lexicographī sine fīnibus
The Unlicense
0 stars 0 forks source link

Planned procedural generic strategy to generate reversible numeric codes for concepts which global standards have no writing system neutral coding alternative #1

Open fititnt opened 2 years ago

fititnt commented 2 years ago

Numerical codes not only are computationally efficient and easier for usage when defining large amounts of codes (such as internal divisions or organizations of their own country, but also used by modern such as Terminologia Anatomica) but also much more ideal for multilingual lexicography.

By "neutral codes" people sometimes think as if different regions hate other alphabets, but actually there are serious usability issues. For example using US-ASCII alpha (which, by the way, is not full Latin alphabet) no matter how hard an average person (not only) native speaker of any Arabic dialect, simply they can't pronounce all letters because several sounds are uncommon. Such a fact actually does happen inside languages which do use the Latin alphabet, to a point of usage coping mechanisms such as using the ICAO spelling alphabet to pronounce each letter. But in comparison, the sound of numbers quite often in most languages is very quite different.

Use cases of why makes sense even coordination of lexicography not have single coordination

TICO-19

One interesting fact we discovered empiracly on the lexicograpy of the [working-draft] Public domain datasets from Translation Initiative for COVID-19 on the format HXLTM (Multilingual Terminology in Humanitarian Language Exchange). (current link here https://github.com/EticaAI/tico-19-hxltm). I will focus on the wordlists (the TICO-19 "terminology" without concept description).

The final result has more errors on non-Latin scripts.  This does't mean errors did not occurred on por-Latn, spa-Latn, ita-Latn, etc (but fun fact, several translations are better than the eng-Latn used as initial reference), and the more common issue was "literal translation". However, despite the "terminology/wordlists" even having professional translators, the issue on non-Latin writing systems was not perceived as quality control and is likely make easier to distribute last step review could improve this.

TICO-19 use case:

  • For a language where the transliteration of "coronavirus" would be "koronavirus" (but in another alphabet) the translation of "CV" (since translator did not understood CV was abbreviation) instead of "KV" was translated also on the letter by letter "CV" on that writing system (so users would not understand connection with "koronavirus"
    • Under ideal circumstances, the "ideal" way to prepare translations would be link terms by concepts (so not only languages on non-Latin scripts, but everyone) could know the better variant. However, as TICO-19 word lists were done under urgency need at that moment, such attempts of organize could only come later.

I could talk more on this topic, but to make an equivalent quality control would not require that lexicographers (people who compile result of others) actually know each language, but know at least one language and know the writing system. This is likely what already was the quality control on TICO-19 for languages in Latin script (likely some of they knew more than English, yet the work was more on translators and reviewer)

License issues

Slow response for humanitarian usage

This topic alone would take full discussion, but even for humanitarian usage, licences are problematic. Emergency initiatives on translations would require much less response time for authorization than lawyers of average organization copyright holder is able to respond.

Not practical to mention everyone collaboration on aggregated result

See also

Also, there are several issues when compiling together work of different organizations. EVEN if it could be possible to know everyone ho helped, and they do donate for free, how do handle this? I will let one example from https://upload.wikimedia.org/wikipedia/commons/1/18/Arguments_on_CC0-licensing_for_data.pdf

Captura-de-tela-de-2022-01-06-22-03-08

How numeric codes can both help with international review from different regions and cope with licensing

While there are other use cases, some way to procedural generate such numbers can help at least with review (or even break work from different regions) and licensing.

In the worst case scenario, the terms on initial reference language can be removed immediately as soon as DMCA requests are done. This also copes with the fact that by default, if minimally creative work is done by volunteers (which by the way lexicographers using Numerordĭnātĭo would already have more context to explain concepts) could not be claimed by any initial implementation.

In practice, this could allow translations initiatives focused on humanitarian area start quickly, and still be welcoming give end work to be validated/reused by the organizations, here aiming general public benefit, however if the lawyers of such organizations try to troll, is up to external lexicography coordinators remove reference to the initial standards for what already is not fair use. An average consequence would mean removing the "copyrighted" source terms (often English and sometimes French) and release well curated versions of everything else in usable file formats friendly to use.

Note that in practice is unlikely such lawyers would go this far against translations for humanitarian use, and is more likely this be done by noob lawyers or "near automated" responses.

fititnt commented 2 years ago

Current internal notes on 1613

Index initial reference Unicode notes Comments
1 | Zs Separator, space Default to generic space, tabs, line breaks... All variants are Zs too
2 + 1. Zs Separator, space TODO: explain more
3 - 1. Pd Punctuation, dash
2. Sm Symbol, math
3. (...)
TODO: needs proof of concept
4 * | x 1. P, Punctuation
2. Sm Symbol, math
3. (...)
TODO: needs proof of concept
5 / | ÷ 1. P, Punctuation
2. Sm Symbol, math
3. (...)
TODO: needs proof of concept
6 = | = 1. P, Punctuation
2. Sm Symbol, math
3. (...)
TODO: needs proof of concept
10 ( | [ | { 1. Ps Punctuation, open All alternatives are granted to be Ps Punctuation, open
11 ) | ] | } 1. Pe Punctuation, close All alternatives are granted to be Pe Punctuation, close
12 _ 1. Pc Punctuation, connector TODO: needs proof of concept
13 \ (special) TODO: need at least one scaping character to be reused without upgrade the mode
19 Private use Not assigned.
56 Not assigned. Not assigned.
57 Not assigned. Not assigned.
58 Not assigned. Not assigned.
59 Not assigned. Not assigned.

Know missing symbols:

Current draft of 1613/1603.2.60.no1.tm.hxl.tsv

#item+conceptum+numerordinatio  #item+rem+i_mul+is_zsym+ix_ndt60+ix_ndt60
0   �
1    
2   +
3   -
4   /
5   =
6   �
7   �
8   �
9   �
10  (
11  )
12  _
13  \
14  �
15  �
16  �
17  �
18  �
19  �
20  0
21  1
22  2
23  3
24  4
25  5
26  6
27  7
28  8
29  9
30  a
31  b
32  c
33  d
34  e
35  f
36  g
37  h
38  i
39  j
40  k
41  l
42  m
43  n
44  o
45  p
46  q
47  r
48  s
49  t
50  u
51  v
52  w
53  x
54  y
55  z
56  �
57  �
58  �
59  �