EticaAI / HXL-Data-Science-file-formats

Common file formats used for Data Science and language localization exported from (and to) HXL (The Humanitarian Exchange Language)
https://hdp.etica.ai/
The Unlicense
3 stars 1 forks source link

[meta] Internationalization and localization (`i18n`, `l10n`) and internal working vocabulary #15

Open fititnt opened 3 years ago

fititnt commented 3 years ago

Quick links:


This issue may be used to make references to the internal working vocabulary and how to deal with Internationalization and localization in special for the [meta issue] hxlm #11.

A lot of work already was done, but in addition to be used internally, for tools like the https://json-schema.org/ that can be used to generate helpers for who uses code editors like VSCode when editing YAML by hand, for allow multiple languages (even for the key, not just he content) eventually we may need to generate the JSON schemas (there is no native way to make JSON Schemas multilanguage).

TODO: add more context

fititnt commented 3 years ago

Maybe will be possible to conditionally load JSON Schemas (and, with JSON Schemas, means auto complete) based not just on file extension (think like ola-mundo.por.hdp.yml vs hello-world.eng.hdp.yml but something like salve-mundi.mul.hdp.yml salve-mundi.hdp.yml.


Edit: from salve-mundi.mul.hdp.yml to salve-mundi.hdp.yml

fititnt commented 3 years ago

Almost there...

hdpcli tests/hrecipe/hello-world.hrecipe.hdp.yml --objectivum-linguam RUS

Captura de tela de 2021-03-17 18-22-05

fititnt commented 3 years ago

We will need to use recursion.

And this need to not try to translate even the inline example data (or, at least for now, the country/territory ISO 2 code names). But I think this still not as bad as the need to be well done, in special when parsing a unknown language to avoid some sort of recursive DDoS.

fititnt commented 3 years ago

We're already able to export the internal representation (heavily based on Latin) on the 6 UN languages plus Portuguese!!!

(It still not checking for input syntax beyond what JSON Schema warn the user, but ok, it's something!)

1. Make any know vocabulary equally valid

The things really shine if any of the 7 languages are able to be equally valid as a full working project. That's the idea. This feature alone make it an huge appeal to use.

Note: the core_vocab, while always will try to export to an unique ID per language, tolerates aliases.

1.1 aliases are good... but the idea is not overuse for macro languages

In other words: core_vocab (plus user ad-hoc customization for unknown languages) tolerate some variation on input. But still an good idea at some point don't force entire macro languages (like Arabic and Chinese) on same ISO 639-3 ISO codes (ok that this could be an hot fix, but is not an ideal)

If necessary, I think we can implement some way to just override part of a vocabulary. So for example if his 20% of an individual language share acceptable conventions with the macro language, we make the HDP itself allow this

2. What would be the "official" version of a file?

Even if, in practice, most teams that already use English as working language would use thing.eng.hdp.yml I like the idea that for resources that are created from someone else still with the very exact file and the HDP tools could still tolerate on the fly more than one file on disk.

This may not be as relevant when everyone speaks the same language, but at least can work as benchmark for when it is with HDP files from others.

2.1 What about if two files on disk are out of sync (like someone edit a version)

I think either by default (should be allow to enable/disable with configuration) or by extra command line, some way to detect if two resources in different languages would deliver different results

2.2 What if two same resources NEED to be different? (Like an file from someone else had an error; or need an update before pass for next person)

Again, I think that this case may need to implicitly allow some way to know that two resources are almost the same... But changes are allowed (maybe just allow override a few parameters).

At a more basic level, I think that just having a small name chance, (like thing-v2.eng.hdp.yml) could do the trick. This may sound lazy, but would be sufficient to not raise errors.

But the idea would be something that (not this week, maybe not this month because I need to do other stuff out of this project) eventually allow digitally signing an HDP file.

And the process of digitally signing, when necessary, needs to me allow humans who could do this a lot of time, but without actually automate too much: do exist a reason why smart cards like Yubikeys have an physical button, and is scary.


Edit: added example

fititnt@bravo:/workspace/git/EticaAI/HXL-Data-Science-file-formats$ hdpcli tests/hrecipe/salve-mundi.hrecipe.mul.hdp.yml --objectivum-linguam RUS

urn:hdp:OO:HS:local:salve-mundi.hrecipe.mul.hdp.yml:
  силосная:
    группа:
      - salve-mundi
    описание:
      ENG: Hello World!
      POR: Olá Mundo!
    страна:
      - AO
      - BR
      - CV
      - GQ
      - GW
      - MO
      - MZ
      - PT
      - ST
      - TL
    тег:
      - CPLP
    язык: MUL
  трансформация-данных:
    - _recipe:
        - aggregators:
            - sum(population) as Population#population
          filter: count
          patterns: adm1+name,adm1+code
        - filter: clean_data
          number: population
          number_format: .0f
      идентификатор: example-processing-with-a-JSON-spec
      пример:
        - источник:
            _sheet_index: 1
            iri: https://data.humdata.org/dataset/yemen-humanitarian-needs-overview
        - источник:
            данные:
              - - header 1
                - header 2
                - header 3
              - - '#item +id'
                - '#item +name'
                - '#item +value'
              - - ACME1
                - ACME Inc.
                - '123'
              - - XPTO1
                - XPTO org
                - '456'
          цель:
            данные:
              - - header 1
                - header 2
                - header 3
              - - '#item +id'
                - '#item +name'
                - '#item +value'
              - - ACME1
                - ACME Inc.
                - '123'
              - - XPTO1
                - XPTO org
                - '456'
fititnt commented 3 years ago

About gettext

I just learned that it is possible to translate even command line options with Python. From this GNU gettext book, the first 200 pages are the most relevant. A good part of them talks about the translation challenge. The author seems to be someone who speaks French.

[Trivia] Latin script even for proper names, need localization (aka conversion of the script)

4.9 Marking Proper Names for Translation Should names of persons, cities, locations etc. be marked for translation or not? People who only know languages that can be written with Latin letters (English, Spanish, French, German, etc.) are tempted to say “no”, because names usually do not change when transported between these languages. However, in general when translating from one script to another, names are translated too, usually phonetically or by transliteration. For example, Russian or Greek names are converted to the Latin alphabet when being translated to English, and English or French names are converted to the Katakana script when being translated to Japanese. This is necessary because the speakers of the target language in general cannot read the script the name is originally written in.

Good to know this. Maybe we never implement, but if have to, at least the command line options (not just the help message) should upfront allow translation.

fititnt commented 3 years ago

So I just got an idea and hxlm.core.localization have some reusable ways to implement it on hxlm.core.model.hdp

1. Fact: actually is more common who works on humanitarian area know more than a language!

Even if the HDP core vocabulary would allow change keywords and by design stopped perfectly 100% valid all variants, in some cases (like human descriptions) is expected that some fields would not be just keywords.

So, even if the library itself could deal (obvious with an subset of verbs) with more language that the user, will exist cases were instead of force all default output be on a single language, the hdpcli could selectively choose what to convert for preview

1.1 Idea as default behavior: for documents already on an user know language, show at it is instead of force to user primary language or use an language for everyone

Even if the user does not add extra parameters or configure better the system, the idea is detect the environment variables and, if do exist an document that is an original source and (even if not primary language of the user) we can choose to show the original.

(This is heavily inspired on how websites are supposed to work; detect browser language)

2. Were this may actually make more impact

While the core verbs are supposed to be equally equivalent across EVERY language (we're almost there to also translate back from non-Latin documents!) If HDP starts to get heavily parts that are not the main focus (think like embed an text document like the Universal Declaration of Human Rights), this could make selectively reduce usage of other APIs (like for quick preview just do machine translation; another feature is if we allow document multiple sources that have specific language, the HDP tools could choose an original of the options already existing.

In general, the HDP would be specially optimized for who knows more than an natural language!

3. Allow such level of localization would not make debugging hard? Or "most people already use English/French/(in Asia they use others)?

I believe the benefits outweigh the problems in particular considering that HDP files could themselves become a wrapper to how to access datasets (with HXL Data processing recipes, even they would not need to be in exact format). Since it could be easy to make the files, this means that we're somewhat exposing the funcironality of HXL-Proxy or hxl official CLI tools for new groups.

In other words: we're both abstracting how to transform data and how people could create recipes for this data. (as for how to access, this is on urnresolver: Uniform Resource Names - URN Resolver #13, but the TL;DR is that the HDP files themselves as hard as possible to people try to put passwords or direct access to resources)

fititnt commented 3 years ago

Note that worth mention: this article https://www.kevinhsieh.net/2019/02/27/chinese-macrolanguage/ from Kevin Hsieh @kahsieh is relevant on this topic.

Also Kevin reference about the book CJKV Information Processing (at least for me) seems eventually a good reading.

fititnt commented 3 years ago

Ok. I think I discovered one way to convert from Russian (in practice, any non-Latin) to other languages without (this is important) becoming a looping hell: instead of doing even more loops we invert the reference dictionary (the core_vocab.yml). For this reference dictionary, we're using the name "Vocabulary Knowledge Graph" (just to differentiate from the Localization Knowledge Graph)

Ok, "invert" actually is not the right term. We "copy" the entire VKG (Latin to all other languages) to a new one, and then add new entries, where the new keys are ids IDs of the new dictionary. If there's some attribute or root term that already is equal to latin, it replaces the one from Latin. This type of dictionary implicitly helps if for some reason one word used like in Russia was actually one of the core terms (Latin).

I did not try all the possibilities, but it is possible to mix at least one additional language and it is still able to transpose.

1. Explaining in other words

2. The strategy of "translation" is (at the moment) called "transposition" in the source code

Trivia: Actually long time ago when machine translation was not as powerful as today, much of the translations (think like Portuguese to Russian) what was done was use English as intermediate language (this often also led to a log of issues because English have less grammatical differences than several other languages; ). Today some languages still use such a type of pivot language.

The underlying functions (that now are broken in more reusable ways) are a bit more easy to automate testing. And testing actually is important, because even if the end result (aka adding new entries to YAML/JSON files) becomes easier to non-programmers, the step until they come there needs some work to grant stability even earlier without fear it adds new features.

The idea of starting to call "transposition" (instead of translations) is, at bare minimum, because I come to the conclusion that some conversions may actually be useful in the same natural language. It also sounds strange to call a function translate if the task would be converted "from English to English with this new digital signature".

Transposition not just for verbs (key terms of HDP files) but also values are not implemented... yet

At the moment, I believe that get right makes every of the 8 equally valid as the source of truth is a great baseline. I'm not saying that this alone is useful, but 

fititnt commented 3 years ago

Already is possible to go back from any know language as if was the reference one :,)

>>> import hxlm.core as HXLm
>>> UDUR_LAT = HXLm.util.load_file(HXLm.HDATUM_UDHR + '/udhr.lat.hdp.yml')
>>> HXLm.L10N.get_language_from_hdp_raw(UDUR_LAT[0])['iso3693']
'LAT'
>>> UDUR_LAT2RUS = HXLm.L10N.transpose_hsilo(UDUR_LAT, 'RUS-Cyrl')
>>> UDUR_LAT2RUS[0]['силосная']['тег']
['udhr']
>>> UDUR_LAT2RUS2POR = HXLm.L10N.transpose_hsilo(
...    UDUR_LAT2RUS, 'POR', 'RUS'
... )
>>> UDUR_LAT2RUS2POR[0]['silo']['etiqueta']
['udhr']

Note that at the HXLm.L10N.transpose_hsilo when going from Russian to Portuguese, had to force the language it has (the header still as if was Latin).

I think we can bump the version, but without the [meta] HDP files strategies of integrity and authenticity (hash, digital signatures, ...) #17 we can't automate even more.

But is possible. But just to get off the first steps is not trivial.

fititnt commented 3 years ago

I believe we will need time sort of way to express rank/ordering as part of the internationalization/localization feature.

The checksums already return an S-expression like hash. It means it is possible to have a compact form to express how it was done, but also S-expression are more easy to construct parsers and they could be even translatable). But beyond the name of the algorithms there is a need to express "what" was hashed. While users could customize very own strings, we could provide some way that even special values would be translatable. This nice-to-have alone could be sufficient to people accept the defaults.

Why numerals (or order of letters in an alphabet)

While some uses actually would not be always a ranked system, it's more easy to create Knowledge graphs that map numbers from different writing systems (we're already using 6 UN working languages plus Portuguese and Latin), so this new feature actually is not that hard). So if someone is executing a query on an document that is not on a local disk with an exact writing system, this means that whoever not used specific strings would be able to use any search query term and it would work.

Start with one (avoid use of zero)

For sake of simplifying potential translations, since decisions need to be made between start rank with zero or one (computers often start with zeros), I think that we should avoid using the meaning of zero. The meaning of zero is not easy to translate/localize. Also, even in natural languages that have the concept of zero, like English, the words to describe zero tent to have much more synonyms, and if for some reason people try to bruteforce understanding, the term for 'zero' could be more likely to be understood as string instead of convert for some more international meaning.

We may still use zero (we can't change programming language interpreters) but at least the terms designed to humans understand could start with 1. It simplifies documentation.


Edit: link to S-expression.

fititnt commented 3 years ago

See also:

Captura de tela de 2021-03-29 12-34-15 Captura de tela de 2021-03-29 12-35-12

Python 3 and Unicode identifiers (Letters, even non-latin, ok, math symbols, not)

Just to mention that Python 3 accept almost all characters someone is willing to put as identifiers (even ç or geek letters like λ) but Unicode mathematical symbols that are not also letters are not accepted as valid identifier.

While for the firsts versions I don't plan suggest we to go full math symbols for all the things (for this we implement localized translations for each language) at least for features that are not meant to average user we could do it with special characters that do not mean anything on most languages. But for the record I'm mentioning this point because this may affect some decisions.

Or maybe this would still be possible, but not as literally identifier even for very internal python implementation.

fititnt commented 3 years ago

Even if it would be NOT recommended end users (think someone creating rules for design by contract and making mistates) do exist some English keywords that (if not in English) would be need be defined in Latin, and from Latin extended by every natural language. The bare minimum keywords tend to be ATOM, CAR, CDR, COND, CONS, EQ, QUOTE (sometimes LAMBDA, sometimes abbreviated as λ).

Since we're not on the 1960 anymore, whoever develop compilers could already use an alphabet that does not use Latin at all. This decision could simplify some work: it is neutral, it could be loaded by default with some other mathematical operations (like + and -, like Ada

This is the current draft:

# This is an draft of what neutral name could be used
b:
  ATOM:
    _*_
  CAR:
    _^_ (How this will behave on Right-to-left languages when compose like CADR _^~_ & CDAR _~^_ ?)
  CDR:
    _~_ (How this will behave on Right-to-left languages when compose like CADR _^~_ & CDAR _~^_ ?)
  COND:
    _?_
  CONS:
    _*_
  EQ:
    _=_
  LAMBDA:
    _λ_   (Not ideal, is an alphabed for an writting system)
    ___   (3 _ seems anonymous enough and is neutral)
  PRINT:
    _:_
  QUOTE:
    _"_
  READ:
    ???  (TODO: think about)
  DEFINE, DEF, DEFN, etc:
    _#_
  L:
    _ISO369-3_ltr-text
    rtl-text_ISO369-3_
          (Note: non-Latin alphabets may need some work to discover how to use term for them)
  "+":
  "-":
  "*":
  "/":

For safety reasons, recommend not use the fallback terms when localization is available (maybe use Nudge)

The reason for these keywords, while may actually available on any language loaded (and maybe even when the user could use some specialized text editors, these terms could be autocomplete to the localized words) they should not be used except for debugging or (maybe if really a new language was added, but some term still missing) only that term fallback to these ones.

The problem is that often we may both have experts working with people that is not expert. So what is ok for one, may not be for other.

So, in this case, in addition to maybe work with most common user interfaces that help developers create these scripts, almost every tool that see the the more internal keywords could nudge (see https://en.wikipedia.org/wiki/Nudge_theory) the user, like by implicitly converting to the more verbose format, to a point of the user have to disable if really want to use the internal.

So, why not fallback to ASCII ATOM, CAR, CDR, COND, CONS, EQ, QUOTE instead of create new terms (if this could be dangerious)?

fititnt commented 3 years ago

The https://github.com/EticaAI/HXL-Data-Science-file-formats/ontologia (and the public endpoint https://hdp.etica.ai/ontologia/) that until few hours ago were more deep in hxlm/ontologia (the Python implementation) now are at the root of the project and I'm dedicating some time to merge some datasets that are pertinent!. This need some care. So lets put in a single place. Anyway, the https://hdp.etica.ai/ontologia/ expose both for who can't download all the tables (likely to be hxlm-js later, when running on browser and needs to build local cache).

The ontologia/

While part of the ontologia is mostly for the Knowledge Graphs (Localization Knowledge Graph, Vocabulary Knowledge Graph) and already have a draft of the internals of the HDPLisp, the idea here is already have a single place for every package from this repository get the data. (This is why have a few symlinks).

This also means that people trying to understand how the internals works, or maybe just doing some quick integration without actually load the libraries from here, can consume just the data. Also by having a single place to put "all shared knowledge" of all underlining implementations of tools here, we can test everything together.

About Monolith / Monorepo

Just to mention that putting several implementations on a single GitHub repository is not considered (on average) a good practice. But in some cases (or at least at this moment, when we're writing same concept for more than one programming language) monorepo can work to allow consistent testing.

But when necessary (like it start to make slow, not fast, to test things) we can split in more projects