Implement and document HXLTM ad hoc templated file generation based on source multilingual dataset (use case: generate monolingual templates, like translated documentation)

fititnt commented 2 years ago

Links

https://hdp.etica.ai/hxltm/archivum/#HXLTM-ad-hoc

The HXLTM have an experimental feature implemented (but not fully documented) called now HXLTM Ad Hoc Fōrmulam (HXLTM templated export) and with output from hxltmcli --help

# (...)
  --objectivum-formulam OBJECTIVUM_FORMULAM
                        Template file to use as reference to generate an output. Less powerful than custom file but can be used for simple cases.
# (...)

The idea of this topic here (maybe also with an example for #2) is use hxltm-action to showcase this feature. Maybe one perfect example is use this to store translations from README.md files on seperate place, then automatically generate the readmes.

This, with combo of fetch translations from remote sources (like Google Sheets) could allow create translations for projects (even as simple as README files).

fititnt commented 2 years ago

It's working!

This action step

      #### HXLTM ad hoc templated export _______________________________________
      - name: ".github/hxltm/hxltmcli.py --objectivum-formulam data/README.🗣️.md --objectivum-linguam hin-Deva@hi data/exemplum/hxltm-exemplum-linguam.tm.hxl.csv data/README.hin-Deva.md"
        uses: fititnt/hxltm-action@main
        continue-on-error: true
        with:
          bin: ".github/hxltm/hxltmcli.py"
          args: |
            --objectivum-formulam data/README.🗣️.md
            --objectivum-linguam hin-Deva@hi
          infile: data/exemplum/hxltm-exemplum-linguam.tm.hxl.csv
          outfile: data/README.hin-Deva.md

... using the database with versions in each language

its the example dataset. omiting here

... converts this template at `data/README.🗣️.md`

\`\`\`json
{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "Testum",
    "type": "object",
    "properties": {
        "L10N_ego_summarius": {
            "description": "{% _🗣️ L10N_ego_summarius 🗣️_ %}",
            "type": "string",
            "example": ""
        },
        "L10N_ego_codicem": {
            "description": "{% _🗣️ L10N_ego_codicem 🗣️_ %}",
            "type": "string",
            "example": ""
        },
        "L10N_ego_linguam_nomen": {
            "description": "{% _🗣️ L10N_ego_linguam_nomen 🗣️_ %}",
            "type": "string",
            "example": ""
        },
        "L10N_ego_scriptum_nomen": {
            "description": "{% _🗣️ L10N_ego_scriptum_nomen 🗣️_ %}",
            "type": "string",
            "example": ""
        },
        "L10N_ego_patriam_UN_M49_numerum": {
            "description": "{% _🗣️ L10N_ego_patriam_UN_M49_numerum 🗣️_ %}",
            "type": "string",
            "example": ""
        },
        "L10N_ego_patriam_UN_P_codicem": {
            "description": "{% _🗣️ L10N_ego_patriam_UN_P_codicem 🗣️_ %}",
            "type": "string",
            "example": ""
        },
        "I18N_testum_salve_mundi_testum_I18N": {
            "description": "{% _🗣️ I18N_testum_salve_mundi_testum_I18N 🗣️_ %}",
            "type": "string",
            "example": ""
        },
        "I18N_إختبار_טעסט_测试_테스트_испытание_I18N": {
            "description": "{% _🗣️ I18N_إختبار_טעסט_测试_테스트_испытание_I18N 🗣️_ %}",
            "type": "string",
            "example": ""
        },
        "I18N_०१२३४५६७८९_〇一二三四五六七八九十百千万亿_-1+2/3*4_٩٨٧٦٥٤٣٢١٠_零壹贰叁肆伍陆柒捌玖拾佰仟萬億_I18N": {
            "//description": " _🗣️ I18N_०१२३४५६७८९_〇一二三四五六七八九十百千万亿_-1+2/3*4_٩٨٧٦٥٤٣٢١٠_零壹贰叁肆伍陆柒捌玖拾佰仟萬億_I18N 🗣️_  ",
            "//comment": "jg-rp/liquid complaints about: + - * /",
            "description": "{% _🗣️ I18N_०१२३४५६७८९_〇一二三四五六七八九十百千万亿_1234_٩٨٧٦٥٤٣٢١٠_零壹贰叁肆伍陆柒捌玖拾佰仟萬億_I18N 🗣️_ %}",
            "type": "string",
            "example": "",
            "//test2": "{% _🗣️ 👁️lat-Latn👁️ 👂Dominium publicum👂 👁️lat-Latn👁️ 🗣️_ %}"
        }
    }
}
\`\`\`

is transformed in `data/README.hin-Deva.md`


\`\`\`json
{
    "$schema": "https://json-schema.org/draft/2020-12/schema",
    "$id": "Testum",
    "type": "object",
    "properties": {
        "L10N_ego_summarius": {
            "description": "हिन्दी भाषा (देवनागरी लिपि)",
            "type": "string",
            "example": ""
        },
        "L10N_ego_codicem": {
            "description": "hin-Deva",
            "type": "string",
            "example": ""
        },
        "L10N_ego_linguam_nomen": {
            "description": "हिन्दी भाषा",
            "type": "string",
            "example": ""
        },
        "L10N_ego_scriptum_nomen": {
            "description": "देवनागरी लिपि",
            "type": "string",
            "example": ""
        },
        "L10N_ego_patriam_UN_M49_numerum": {
            "description": "001",
            "type": "string",
            "example": ""
        },
        "L10N_ego_patriam_UN_P_codicem": {
            "description": "∅",
            "type": "string",
            "example": ""
        },
        "I18N_testum_salve_mundi_testum_I18N": {
            "description": "नमस्ते दुनिया",
            "type": "string",
            "example": ""
        },
        "I18N_إختبار_טעסט_测试_테스트_испытание_I18N": {
            "description": "परीक्षा, १, २, ३",
            "type": "string",
            "example": ""
        },
        "I18N_०१२३४५६७८९_〇一二三四五六七八九十百千万亿_-1+2/3*4_٩٨٧٦٥٤٣٢١٠_零壹贰叁肆伍陆柒捌玖拾佰仟萬億_I18N": {
            "//description": " _ I18N_०१२३४५६७८९_〇一二三四五六七八九十百千万亿_-1+2/3*4_٩٨٧٦٥٤٣٢١٠_零壹贰叁肆伍陆柒捌玖拾佰仟萬億_I18N   ",
            "//comment": "jg-rp/liquid complaints about: + - * /",
            "description": "!!!I18N_०१२३४५६७८९_〇一二三四五六七八九十百千万亿_1234_٩٨٧٦٥٤٣٢١٠_零壹贰叁肆伍陆柒捌玖拾佰仟萬億_I18N!!!",
            "type": "string",
            "example": "",
            "//test2": "!!!👁️lat-Latn👁️ 👂Dominium publicum👂 👁️lat-Latn👁️!!!"
        }
    }
}
\`\`\`

One point of improvement

One not so nice issue optimization (that is not as clear for small number of languages, but the idea is scale up) is that without some syntatic suguar, each language generate requires a different hxltmcli command.

This means that for 6 languages (sorry, forgot russian and french on test dataset, so would be at least 8) requires 6 code repetitions where only thing that changes is the language code

Captura de tela de 2021-11-07 08-10-01

Well, it's already good, just not as great, but it works.

fititnt commented 2 years ago

Okay. I think one approach here would be start to prepare the hxltmcli to support this type of syntax

.github/hxltm/hxltmcli.py \
  --objectivum-formulam data/README.🗣️.md \
  --objectivum-linguam por-Latn@pt \
  data/exemplum/hxltm-exemplum-linguam.tm.hxl.csv \
  data/README.{[iso6393]}-{[iso115924]}.md

... in such way that when it detect that there is a specific objective language, (in HXLTM slang, HXLTMLinguam) like por-Latn@pt) the data/README.{[iso6393]}-{[iso115924]}.md would be replaced by data/README.por-Latn.md.

This step reduces a bit of redundant code (and also simplifies the hxltm-action, wich uses the underlining python cli tooling).

Some points

Still better to keep each `hxltmcli` / `hxltmdexml` individual step

For short term, the hxltm-action, at least the entrypoint.sh could do some sort of looping. this also could help to detect error for individual languages, without break everything else (and, simplifies a bit the python cli tooling).

To make hxltmcli / hxltmdexml poweful to make multiple operations, would be better follow the idea of what HXLStandard cli tooling call "JSON specs" (see https://github.com/HXLStandard/libhxl-python/wiki/JSON-specs).

The syntactic sugar for `hxltm-action` may require implement the concept of "working languages" and "auxiliary language"

The underlining cli tooling uses latin, as in --agendum-linguam as working language and --auxilium-linguam for both concepts. The first one is somewhat related to all possible languages that, if exist on HXLTM source reference, will be used. The second one, while not fully implemented, tells with fallback language to use when there is no translation available.

hxltmcli --help
# hxltmcli v0.8.7
# (...)
  --agendum-linguam agendum_linguam, -AL agendum_linguam
                        (Planned, but not fully implemented yet) Restrict working
                        languages to a list. Useful for HXLTM to HXLTM or multilingual
                        formats like TBX and TMX. Requires: multilingual operation.
                        Accepts multiple values.

# (...)
  --auxilium-linguam auxilium_linguam, -AUXL auxilium_linguam
                        (Planned, but not implemented yet) Define auxiliary language.
                        Requires: bilingual operation (and file format allow
                        metadata). Default: Esperanto and Interlingua Accepts multiple
                        values.

# (...)
  --objectivum-formulam OBJECTIVUM_FORMULAM
                        Template file to use as reference to generate an output. Less
                        powerful than custom file but can be used for simple cases.

The problem with hardcoded auxilar language (cases where already exist multiple source languages with 100% officially valid

If the next step is implemented, the --auxilium-linguam (which I think already works for define several options to fall back) the HXLTM ontologia eventually will know how near languages are.

This means an user may use --auxilium-linguam to hardcode english as fallback language, when do exist (this case is very relevant for macro languages) a close language. It should have some way if the ontologia AND the dataset (which would likely to be controlled by volunteers, maybe even language regulators) make harder (or even require extra parameter) to break the operation. I mean, the --auxilium-linguam can work on short term, but the ideal is project even developer who is assembling the results to not make mistakes for languages he don't know.

fititnt / hxltm-action