EticaAI / hxltm

HXLTM - Multilingual Terminology in Humanitarian Language Exchange.TBX, TMX, XLIFF, UTX, XML, CSV, Excel XLSX, Google Sheets, (...)
https://hxltm.etica.ai
The Unlicense
1 stars 1 forks source link

`hxltmcli --objectivum-XLIFF`: HXL Trānslātiōnem Memoriam -> XLIFF Version 2.1 #1

Closed fititnt closed 3 years ago

fititnt commented 3 years ago

Test file started here https://github.com/HXL-CPLP/Auxilium-Humanitarium-API/blob/main/_systema/programma/hxltm2xliff.py.

fititnt commented 3 years ago

I think, in addition to the final XLIFF format, we're also draft one 'intermediary format, that is half the way between the HXL TM file convention and the XLIFF format.

This intermediate mostly parse input tm.hxl.csv (or anything that HXL tools are able to parse, like google spreadsheet, excel, etc) and rename columns that it knows that matters for the XLIFF format and prefix them with #x_xliff

Some corner cases, like lack of XLIFF support source of translations are not even ready for translation (something that may actually be very common for our use cases) we may prefix with #meta+xliff

Current examples

cat _hxltm/schemam-un-htcds-5items.tm.hxl.tsv

#x_xliff+unit+id    #meta+url   #item+wikidata+code #meta+item+url+list #meta+lat_sortem    #status #item+type+lat_dominium+list    #item+type+lat_regnum   #item+type+lat_divisionem   #item+type+lat_classem  #item+type+lat_ordinem  #item+type+lat_familiam #item+type+lat_genus    #item+type+lat_speciem  #item+type+lat_segmentum                        #x_xliff+source+i_lat+is_latn   #item+i_la+i_lat+is_latn+alt+list   #meta+item+i_la+i_lat+is_latn   #item+i_pt+i_por+is_latn    #item+i_pt+i_por+is_latn+alt+list   #meta+item+i_pt+i_por+is_latn   #item+i_en+i_eng+is_latn    #item+i_en+i_eng+is_latn+alt+list   #meta+item+i_en+i_eng+is_latn   #item+i_es+i_spa+is_latn    #item+i_es+i_spa+is_latn+alt+list   #meta+item+i_es+i_spa+is_latn   #x_xliff+target+i_arb+is_arab   #item+i_es+i_arb+is_arab+alt+list   #meta+item+i_es+i_arb+is_arab   #item+i_hi+i_hin+is_deva    #item+i_hi+i_hin+is_deva+alt+list   #meta+item+i_hi+i_hin+is_deva   #item+i_sl+i_slv+is_latn    #item+i_sl+i_slv+is_latn+alt+list   #meta+item+i_sl+i_slv+is_latn
L10N_ego_summarius  [(ℹ️)]  Q1  https://github.com/HXL-CPLP/forum/issues/58|https://example.org 1   2   L10N    L10N    ego                 summarius       Lingua Latina (Abecedarium Latinum) ∅   ∅   Língua portuguesa (alfabeto latino) ∅   ∅   English language (Latin script) ∅   ∅   Idioma español (Alfabeto latino)    ∅∅  اللغة العربية   ∅   يتطلب مراجعة بشرية. हिन्दी भाषा (देवनागरी लिपि) ∅   ∅   Slovenščina (Latinska abeceda)  ∅   ∅
L10N_ego_codicem                2   2   L10N    L10N    ego                 codicem                                     lat-Latn    ∅   ∅   por-Latn    ∅   ∅   eng-Latn    ∅   ∅   spa-Latn    ∅   ∅   arb-Arab    ∅   ∅   hin-Deva    ∅   ∅   slv-Latn    ∅∅
L10N_ego_linguam_nomen              3   2   L10N    L10N    ego linguam             nomen                                       Lingua Latina   ∅   ∅   Língua portuguesa   ∅   ∅   English language    ∅   ∅   Idioma español  ∅   ∅   اللغة العربية   ∅   يتطلب مراجعة بشرية. हिन्दी भाषा ∅   https://www.wikidata.org/wiki/Q1568 Slovenščina ∅   ∅
L10N_ego_scriptum_nomen [(ℹ️)]  Q19845720   https://www.unicode.org/iso15924/   4   2   L10N    L10N    ego scriptum                nomen               Abecedarium Latinum ∅   ∅   Alfabeto latino ∅   ∅   Latin script    ∅   ∅   Alfabeto latino ∅   ∅       ∅   ∅   देवनागरी लिपि   ∅   https://www.wikidata.org/wiki/Q38592    Latinska abeceda    ∅   ∅
L10N_ego_patriam_UN_M49_numerum [(ℹ️)]  Q7865431    https://en.wikipedia.org/wiki/UN_M49    5   2   L10N    L10N    ego patriam UN  M49     numerum             001 ∅   ∅   001 ∅   ∅   001 ∅   ∅   001 ∅   ∅   001 ∅   ∅   001 ∅   ∅   001 ∅   ∅

./_systema/programma/hxltm2xliff.py _hxltm/schemam-un-htcds-5items.tm.hxl.csv --archivum-extensionem=.csv

#x_xliff+unit+id,#meta+url,#item+wikidata+code,#meta+item+url+list,#meta+lat_sortem,#status,#item+type+lat_dominium+list,#item+type+lat_regnum,#item+type+lat_divisionem,#item+type+lat_classem,#item+type+lat_ordinem,#item+type+lat_familiam,#item+type+lat_genus,#item+type+lat_speciem,#item+type+lat_segmentum,,,,,,,,,,,,#x_xliff+source+i_lat+is_latn,#item+i_la+i_lat+is_latn+alt+list,#meta+item+i_la+i_lat+is_latn,#item+i_pt+i_por+is_latn,#item+i_pt+i_por+is_latn+alt+list,#meta+item+i_pt+i_por+is_latn,#item+i_en+i_eng+is_latn,#item+i_en+i_eng+is_latn+alt+list,#meta+item+i_en+i_eng+is_latn,#item+i_es+i_spa+is_latn,#item+i_es+i_spa+is_latn+alt+list,#meta+item+i_es+i_spa+is_latn,#x_xliff+target+i_arb+is_arab,#item+i_es+i_arb+is_arab+alt+list,#meta+item+i_es+i_arb+is_arab,#item+i_hi+i_hin+is_deva,#item+i_hi+i_hin+is_deva+alt+list,#meta+item+i_hi+i_hin+is_deva,#item+i_sl+i_slv+is_latn,#item+i_sl+i_slv+is_latn+alt+list,#meta+item+i_sl+i_slv+is_latn
L10N_ego_summarius,[(ℹ️)],Q1,https://github.com/HXL-CPLP/forum/issues/58|https://example.org,1,2,L10N,L10N,ego,,,,,summarius,,,,,,,,,,,,,Lingua Latina (Abecedarium Latinum),∅,∅,Língua portuguesa (alfabeto latino),∅,∅,English language (Latin script),∅,∅,Idioma español (Alfabeto latino),∅,∅,اللغة العربية,∅,يتطلب مراجعة بشرية.,हिन्दी भाषा (देवनागरी लिपि),∅,∅,Slovenščina (Latinska abeceda),∅,∅
L10N_ego_codicem,,,,2,2,L10N,L10N,ego,,,,,codicem,,,,,,,,,,,,,lat-Latn,∅,∅,por-Latn,∅,∅,eng-Latn,∅,∅,spa-Latn,∅,∅,arb-Arab,∅,∅,hin-Deva,∅,∅,slv-Latn,∅,∅
L10N_ego_linguam_nomen,,,,3,2,L10N,L10N,ego,linguam,,,,nomen,,,,,,,,,,,,,Lingua Latina,∅,∅,Língua portuguesa,∅,∅,English language,∅,∅,Idioma español,∅,∅,اللغة العربية,∅,يتطلب مراجعة بشرية.,हिन्दी भाषा,∅,https://www.wikidata.org/wiki/Q1568,Slovenščina,∅,∅
L10N_ego_scriptum_nomen,[(ℹ️)],Q19845720,https://www.unicode.org/iso15924/,4,2,L10N,L10N,ego,scriptum,,,,nomen,,,,,,,,,,,,,Abecedarium Latinum,∅,∅,Alfabeto latino,∅,∅,Latin script,∅,∅,Alfabeto latino,∅,∅,,∅,∅,देवनागरी लिपि,∅,https://www.wikidata.org/wiki/Q38592,Latinska abeceda,∅,∅
L10N_ego_patriam_UN_M49_numerum,[(ℹ️)],Q7865431,https://en.wikipedia.org/wiki/UN_M49,5,2,L10N,L10N,ego,patriam,UN,M49,,numerum,,,,,,,,,,,,,001,∅,∅,001,∅,∅,001,∅,∅,001,∅,∅,001,∅,∅,001,∅,∅,001,∅,∅
fititnt commented 3 years ago

What started as hxltm2xliff.py for some months ago already is a user-configurable generator from the hxltmcli (https://hdp.etica.ai/hxltm) program with options like hxltmcli --objectivum-XLIFF, see https://hdp.etica.ai/hxltm/archivum/.

How it was done?

The HXLTM ASA EticaAI/HXL-Data-Science-file-formats#22 abstract in such way how to iterate with HXL with some conventioned extra tags that is possible to both import from XLIFF and export from HXL to XLIFF only by configuring an custom plugin. So the XLIFF, like TBX, TMX, XML, etc, uses an user-friendly syntax, the liquid https://shopify.github.io/liquid/ for templating, and extra attributes

The hxltmcli v0.8.7 (can be used as standalone or with Python package hdp-toolchain https://pypi.org/project/hdp-toolchain/) uses the cor.hxltm.yml and the hxltmdexml (to convert back from any XML file used to export) based on this


  #### XLIFF-obsoletum: XML Localization Interchange File Format (XLIFF) v2.1 __
  # tag::normam_XLIFF[]

  # @TODO: JLIFF (XLIFF on JSON) <https://github.com/oasis-tcs/xliff-omos-jliff>
  XLIFF:
    __meta:
      archivum_extensionem: .xlf
      situs_interretialis:
        referens_officinale:
          - <https://www.oasis-open.org/committees/xliff/>
        vicipaedia:
          - <https://en.wikipedia.org/wiki/XLIFF>
      exemplum:
        - <https://github.com/oasis-tcs/xliff-xliff-22>
        - <https://github.com/oasis-tcs/xliff-xliff-22/blob/master/xliff-21/test-suite/core/valid/allExtensions.xlf>
        - <https://github.com/oasis-tcs/xliff-xliff-22/blob/master/xliff-21/test-suite/core/valid/everything-core.xlf>
      normam:
        - <https://docs.oasis-open.org/xliff/xliff-core/v2.1/xliff-core-v2.1.html>
        # - <https://docs.oasis-open.org/xliff/xliff-core/v2.1/os/schemas/>
        # @see <https://github.com/redhat-developer/vscode-xml/wiki/XMLValidation#XML-catalog-with-XSD>
        # @see <https://github.com/redhat-developer/vscode-xml/issues/315>
        - <https://docs.oasis-open.org/xliff/xliff-core/v2.1/os/schemas/catalog.xml>
      nomen:
        eng-Latn: 'XML Localization Interchange File Format (XLIFF) v2.1'

    asa:
      modus_operandi:
        # - multiplum_linguam
        - bilingue

    de_xml:
      # This is a working draft
      # @see https://terminator.readthedocs.io/en/latest/tbx_conformance.html
      # ontologia libellam: I glossarium > II conceptum > III linguam > IV terminum
      glossarium_radicem:
        signum: xliff
        # Exemplum I: <xliff version="1.2">
        # Exemplum II: <xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
      glossarium_titulum: False

      # II conceptum
      conceptum_codicem:
        signum: unit
        de_attributum: id
        trivium:
          # de <xliff> ad <trans-unit>
          - file

      # III linguam
      linguam_codicem: False # XLIFF-obsoletum est bilingue

      linguam_fontem_codicem:
        # Exemplum: 'pt' ad '<source xml:lang="pt">por-Latn</source>''
        signum: source
        de_attributum: lang
        trivium: []

      linguam_objectivum_codicem:
        # Exemplum: 'es' ad '<target xml:lang="es">spa-Latn</target>''
        signum: target
        de_attributum: lang
        trivium: []

      # IV terminum

      terminum_accuratum: False # XLIFF terminum habendum accuratum? Falsum
      terminum_multum: False # XLIFF-obsoletum est bilingue
      terminum_habendum_fontem: True
      terminum_habendum_objectivum: True

      terminum_fontem_valorem:
        # Exemplum: 'por-Latn ad <source xml:lang="pt">por-Latn</source>
        signum: source
        # de_attributum: False
        trivium: []

      terminum_objectivum_valorem:
        # Exemplum: 'spa-Latn' ad <target xml:lang="es">spa-Latn</target>
        signum: target
        # de_attributum: False
        trivium: []

    formatum:

      # @see https://docs.oasis-open.org/xliff/xliff-core/v2.1/os/schemas/catalog.xml
      # @see https://docs.oasis-open.org/xliff/xliff-core/v2.1/os/schemas/xliff_core_2.0.xsd
      initiale: |2
        <?xml version="1.0"?>
        <xliff version="2.0"
          xmlns="urn:oasis:names:tc:xliff:document:2.0"
          xmlns:fs="urn:oasis:names:tc:xliff:fs:2.0"
          xmlns:val="urn:oasis:names:tc:xliff:validation:2.0"
          srcLang="{{ globum.fontem_linguam.bcp47 | default: 'la' }}"
          trgLang="{{ globum.objectivum_linguam.bcp47 | default: 'ar' }}">
          <file id="f1">

      corporeum: |2
            {% if rem.de_fontem_linguam -%}
            <unit id="{{ conceptum.codicem | default: rem.de_nomen_breve.conceptum_codicem | default: 'errorem' | replace: '*', '' | replace: '+', '' | replace: '/', '' }}">
              {% if rem.de_auxilium_linguam or rem.de_nomen_breve.referens_situs_interretialis.size > 0 -%}
              <notes>
                {%- for item in rem.de_auxilium_linguam -%}
                <note appliesTo="source" priority="3"
                  category="de_auxilium_linguam">
                  _[{{- item.linguam -}}]
                  {{- item.rem -}}
                  [{{- item.linguam -}}]_
                </note>
                {%- endfor %}
                {% for item in rem.de_nomen_breve.referens_situs_interretialis -%}
                <note appliesTo="source" priority="1"
                  category="referens_situs_interretialis">
                  {{ item }}
                </note>
                {% endfor -%}
              </notes>
              {% else -%}
              <!--
                non rem.de_auxilium_linguam aut rem.de_nomen_breve.referens_situs_interretialis
              -->
              {% endif -%}
              <segment state="{{ rem.de_objectivum_linguam.codicem_XLIFF | default: 'initial' }}">
                <source>{{ rem.de_fontem_linguam.rem }}</source>
                {%- if rem.de_objectivum_linguam and rem.de_objectivum_linguam.rem != '' %}
                <target>{{ rem.de_objectivum_linguam.rem }}</target>
                {%- else %}
                <!-- non rem.de_objectivum_linguam -->
                {%- endif  %}
              </segment>
            </unit>
            {%- else -%}
              <!-- non rem.de_fontem_linguam -->
            {%- endif  %}

      # <!-- {{ rem }} -->
      finale: |2
          </file>
        </xliff>

The instructions above are for XLIFF 2, the XLIFF 1 is another option. While how to create other exporters/importers is not documented, using as starting point the close example than what is desired works best. One biggest difference is about either bilingual (like XLIFF and some common localization files) and multilingual (like TBX and TMX).

With future versions, the syntax may change a but HXL already is the best strategy to store multilingual content for who works with XLIFF. Most tools not even allow manage with more than one source language, so the HXLTM (as specialized tagging of HXL) actually now at least allow operate with translations from/to arbitrary number of source/target languages.