SEMICeu / Core-Person-Vocabulary

This is the issue tracker for the maintenance of Core Person Vocabulary
15 stars 4 forks source link

`/ability to understand others speaking/` and `/ability to be understood by others while speaking/`; `/ability to read a language in a specific writing system/` and `/ability to write a language in a specific writing system/` #40

Open fititnt opened 2 years ago

fititnt commented 2 years ago

As per Wiki documentation, this proposal is divided into I - submitter name and affiliation, portal; II - service or software product represented or affected; III - clear and concise description of the problem or requirement; IV - proposed solution, if any.

I - submitter name and affiliation

Emerson Rocha, from @EticaAI.

Some context: as part of @HXL-CPLP, the submitter is working both on community translation initiatives for humanitarian use and (because lack of usable data standards) specialized public domain tooling to manage and exchange multilingual terminology (so different lexicographers can compile results, without centralization).

II - portal, service or software product represented or affected

Non exhaustive list of affected points

III - clear and concise description of the problem or requirement

Comments on nomenclature and symbols used on this proposal:

  • The / // [ [[ notation:
    • his syntax is inspired by a non-standard International Phonetic Alphabet usage of delimiters. A quick explanation would be:
    • /vague term/
    • //extra vague term//
    • [precise term]
    • [[extra precise term]]


IV - proposed solution, if any

The submitter proposes no less than 2 fields, but ideally four. But whatever is the result, it is important to differentiate skill on spoken language and writing language as baseline, so if a person is native speaker and has done education the data fields could have the same values.

The sugestions here use 4 fields.

An compromise to help improved tables to provide labels for such content is also added at the end.

A. The asked addition to CPV

1. /ability to understand others speaking/@eng-Latn

§ IV 1

General idea of definition: ability of the person to understand another person of the spoken or sign language

2. /ability to be understood by others while speaking/@eng-Latn

3. /ability to read a language in a specific writing system/@eng-Latn

§ IV 3

4. /ability to write a language in a specific writing system/@eng-Latn

B. Tasks the submitter is willing to help

Even if not with exact same terms, the 4 fields to collect such data AND a numerical scale is a huge improvement. Note explanation on the whys can be given, but on "worst case" the codes are BCP-47, which is a standard.

However if this issue goes ahead, the submitter can promise to do it's best to actually go after a stable way to provide pre-compiled tables with all multilingual information needed. Without this implementers would get stuck. And even if they do use BCP-47 properly, it's so hard to get the existing translations for the names of the languages that most submissives to Unicode CLDR on this day are only usable by Apple, Google, Microsoft and few other tech giants.

B.1 Where to publish pre-compiled data

B.1.1 Did the European Union have a CKAN?

I'm not aware of this.

B.1.2 The Humanitarian Data Exchange

https://data.humdata.org is a great alternative to publish periodically updated from the sources such dataset.

Several of it's sets are already automated. See https://github.com/OCHA-DAP to see as reference.

With some discussion, as long as such compiled tables have a license which can be exchanged and there is minimum interest for humanitarian usage, the end result not only can be sent "one time" there, but become part of the process of getting continuous updating.

This means codes and translations wouldn't get outdated, and everyone could have a more centralized reference.

This type of dataset is algo a great candidate for https://vocabulary.unocha.org/.

B.2 Licensing issues

Even for humanitarian usage, it is more likely that merging Unicode CLDR, ISO 639-3 and Glotocode can be delayed because of licensing issues than technical viability. I'm saying this upfront because the idea of expecting each implementer to merge several datasets AND keep them updated is unrealistic. So the experience in the humanitarian sector is the need of pre-compiled datasets more ready to use.

Both if the European Commission does have an "CKAN-like" or The Humanitarian Data Exchange, I know upfront they will ask about licensing.

B.3 Proof of concepts that translations for such terms do exist and most algoritms to calculate more related language are ready for use

The Unicode CLDR (https://cldr.unicode.org/, https://github.com/unicode-org) have both translations for languages and scripts, but also have Territory-Language Information https://unicode-org.github.io/cldr-staging/charts/40/supplemental/territory_language_information.html.

One difference betwen this proposal and the CLDR Territory-Language Information is that CLDR have only 2 fields (Literacy% vs Written%) instead of 4.

eng-Latn (CLICK HERE) ```bash ititnt@bravo:~/Documentos/temp/Core-Person-Vocabulary-reply$ /workspace/git/EticaAI/tico-19-hxltm/scripts/fn/linguacodex.py --de_codex eng-Latn | jq { "language": "English", "script": "Latin", "macro_linguae": false, "codex": { "BCP47": "en", "ISO639P3": "eng", "ISO639P2B": "eng", "ISO639P2T": "eng", "ISO639P1": "en", "Glotto": "stan1293", "ISO15924A": "Latn" }, "communitas": { "litteratum": 1636849041, "scribendum": 1327465383 }, "nomen": { "intranomen": "English", "externomen": { "ar-Arab": "الإنجليزية", "hy-Armn": "անգլերեն", "ru-Cyrl": "английский", "hi-Deva": "अंग्रेज़ी", "gu-Gujr": "અંગ્રેજી", "el-Grek": "Αγγλικά", "ka-Geor": "ინგლისური", "pa-Guru": "ਅੰਗਰੇਜ਼ੀ", "zh-Hans": "英语", "zh-Hant": "英文", "he-Hebr": "אנגלית", "ko-Jamo": "영어", "jv-Java": "English", "ja-Kana": "英語", "km-Khmr": "អង់គ្លេស", "kn-Knda": "ಇಂಗ್ಲಿಷ್", "lo-Laoo": "ອັງກິດ", "la-Latn": "English", "my-Mymr": "အင်္ဂလိပ်", "su-Sund": "English", "ta-Taml": "ஆங்கிலம்", "te-Telu": "ఇంగ్లీష్", "th-Thai": "อังกฤษ", "bo-Tibt": "དབྱིན་ཇིའི་སྐད།", "ii-Yiii": "ꑱꇩꉙ" } }, "praejudicium": [], "__meta": { "de_codex": "eng-Latn" } } ```

Captura de tela de 2021-12-31 20-47-27

Also, the commented algorithms about "closest language to reference ones" do already exist. This not even need access to remove services.


Trivia: This proposal is still suggested during the year 2021 according to Portugal's time zone.

EmidioStani commented 1 year ago

I think these properties could be useful for the European Learning Model v3, see: https://github.com/european-commission-empl/European-Learning-Model/