EticaAI / hxltm

HXLTM - Multilingual Terminology in Humanitarian Language Exchange.TBX, TMX, XLIFF, UTX, XML, CSV, Excel XLSX, Google Sheets, (...)
https://hxltm.etica.ai
The Unlicense
0 stars 0 forks source link

`hxltmcli .asa.hxltm.json / .asa.hxltm.yml`: HXLTM Abstractum Syntaxim Arborem #3

Open fititnt opened 3 years ago

fititnt commented 3 years ago
# @ARCHIVUM       ontologia/cor.hxltm.yml
# @DESCRIPTIONEM  HXL Trānslātiōnem Memoriam (HXLTM)
# @LICENTIAM      Dominium publicum
formatum:
  # (...)

  HXLTM-ASA:
    __meta:
      archivum_extensionem: 
        - .asa.hxltm.json
        - .asa.hxltm.yml
      normam:
        - https://hdp.etica.ai/hxltm/archivum/#HXLTM-ASA
      descriptionem: |
        _[eng-Latn]
        The HXLTM-ASA is an not strictly documented Abstract Syntax Tree
        of an data conversion operation.

        This format, different from the HXLTM permanent storage, is not
        meant to be used by end users. And, in fact, either JSON (or other
        formats, like YAML) are more a tool for users debugging the initial
        reference implementation hxltmcli OR developers using JSON
        as more advanced input than the end user permanent storage.

        Warning: The HXLTM-ASA is not meant to be an stricly documented format
        even if HXLTM eventually get used by large public. If necessary,
        some special format could be created, but this would require feedback
        from community or some work already done by implementers.
        [eng-Latn]_

        Trivia:
          - abstractum, https://en.wiktionary.org/wiki/abstractus#Latin
          - syntaxim, https://en.wiktionary.org/wiki/syntaxis#Latin
          - arborem, https://en.wiktionary.org/wiki/arbor#Latin
          - conceptum de Abstractum Syntaxim Arborem
            - https://www.wikidata.org/wiki/Q127380
      nomen:
        eng-Latn: 'HXLTM Abstractum Syntaxim Arborem'
      situs_interretialis:
        referens_officinale:
          - https://hdp.etica.ai/hxltm
          - https://github.com/EticaAI/HXL-Data-Science-file-formats/issues/223
          - https://github.com/EticaAI/HXL-Data-Science-file-formats/labels/HXLTM

The idea of create a format to use HXL to store both translation memories (not just the XLIFF format) but also glossaries but in special terminology is hardcore. Not so from the code implementation, but from the point of the issue it tries to abstract is complex.

Even if mostly for internal usage (e.g. not strictly documented for external use) instead of we 'convert' HXLated data (aka CSVs) to other formats (in special the XML ones) we're already drafting what could be called an Abstrac Syntax Tree (https://en.wikipedia.org/wiki/Abstract_syntax_tree). It can be a simpler one, but at least we're not passing to converters raw CSV pointers.

Comparison to others linguistic Abstract Syntax

See also:

Turns out that do exist some long time ideas about abstract linguistic content, but what could be called 'HXLTM ASA' is more at container level (as it could be useful to convert from file types) than at term level (as it would be to undestand what a term is to use for translate concepts).

So even if HXLTM ASA becomes usable for external tools, we will not even try to do too much micro management. BUT one thing we could do here is intentionally let it easy for others to convert for whatever format they want and we do not try to be strict on what HXLTM ASA is, so if someone else would want to inject even more details at term level, they could.

On Grammatical Framework

The Grammatical Framework (that is cited a lot on the Abstract Syntax as Interlingua) seems to be the state of the ar of how to generate a way to understand sentences in different natural languages. I, Rocha, do not plan to go deep on this, since the sort to medium term interest is more about how to store terminology and translations memories, and if the minimal implementation to support TBX export already can take time, the best I could do is make easier to (if do exist interest year later) people use HXLTM dialects to store linguistic data while still have decent portability between other data formats.