acoli-repo / acoli-dicts

3000+ machine-readable open source dictionaries distributed by the Applied Computational Linguistics lab at the University of Augsburg, Germany, and by the research group Linked Open Dictionaries (LiODi, funded 2015-2020 by BMBF at Goethe University Frankfurt, Germany). All data provided in OntoLex-Lemon and TIAD-TSV.
Apache License 2.0
10 stars 2 forks source link

Apertium RDF - Tags embedded in complex <par> tags in source data lost during extraction #13

Open jubosgil opened 3 years ago

jubosgil commented 3 years ago

This goes back to an issue reported for the mapping (https://github.com/sid-unizar/apertium-lexinfo-mapping/issues/2), but it turns out to affect other tags as well. Discussed in message exchange with Max Ionov.

In the source data we have chunks such as

%Example for CA-IT
<pardef n="mil">
  <e>       <i><s n="num"/></i></e>
     <!--<e><p><l><b/>mil<s n="num"/></l><r>mila<s n="num"/></r></p></e> -->
</pardef>

And later in that same lexicon:

%Example for CA-IT
<e>       <p><l>tres</l>                               <r>tre</r></p><par n="mil"/></e>
<e>       <p><l>quatre</l>                             <r>quattro</r></p><par n="mil"/></e>
<e>       <p><l>cinc</l>                               <r>cinque</r></p><par n="mil"/></e>
<e>       <p><l>sis</l>                                <r>sei</r></p><par n="mil"/></e>
<e>       <p><l>set</l>                                <r>sette</r></p><par n="mil"/></e>
<e>       <p><l>vuit</l>                               <r>otto</r></p><par n="mil"/></e>
<e r="LR"><p><l>huit</l>                               <r>otto</r></p><par n="mil"/></e>

, which in the intermediate RDF leads to the entry tres being described with lexinfo:morphosyntacticProperty apertium:mil, but mil seems to be a defined shorthand for a bundle of information in this lexicon (e.g. including in that info the Apertium tag num), and not a tag belonging to the set of Apertium tags that we are mapping to LexInfo (in contrast to num, which is a lexinfo:numeral).

I have checked more dictionaries, and this happens often. Some examples here to have a better idea:

%%EN-GL (8 cases):
<pardef n="twenty_hundred__num"> ..
<pardef n="two__num"> ... 
<pardef n="two(1)__num"> ... 
<pardef n="three__num"> ... 
<pardef n="three(1)__num"> ... 
<pardef n="twenty__num"> ... 
<pardef n="one__num"> ... 
<pardef n="thirty_hundred__num"> .. 

%%OCI-FR (34 cases):
      <pardef n="sp_ND">
      <e r="LR"><p><l><s n="sp"/></l><r><s n="ND"/></r></p></e>
      <e r="RL"><p><l><s n="sp"/></l><r><s n="sg"/></r></p></e>
      <e r="RL"><p><l><s n="sp"/></l><r><s n="pl"/></r></p></e>
    </pardef>
      <pardef n="mf_GD"> ...
      <pardef n="ND_sp"> ...
      <pardef n="GD_mf"> .. 
       ... 
%%RO-CA (44 cases): 
      <pardef n="4-4__adj"> ..
      <pardef n="4-3sg__adj"> .. 
      <pardef n="4-3pl__adj"> .. 
      ... 
%%ES-CA (44 cases): 
      <pardef n="hd-hour">
        <e><re>[0-9]</re></e>
        <e><re>[01][0-9]</re></e>
        <e><re>2[0-3]</re></e>
      </pardef>
      <pardef n="ordinals"> [...]
      <pardef n="sp_ND"> [...]
      <pardef n="pl_ND"> [...]
      <pardef n="sgpl_sgpl">
       <pardef n="m_GD">
      <e r="LR"><p><l><s n="m"/></l><r><s n="GD"/></r></p></e>
      <e r="RL"><p><l><s n="m"/></l><r><s n="f"/></r></p></e>
      <e r="RL"><p><l><s n="m"/></l><r><s n="m"/></r></p></e>
    </pardef>
    ....

 %%PT-CA (32 cases): 
    <pardef n="sp_ND"> ...
    [...]
    <pardef n="sg_ND"> ... 

Since the "embedded" tags are not accessed in the extraction, in the final RDF we are mantaining the complex/shorthand ones (e.g. apertium: 4-3pl__adj, apertium:miletc. without a mapping).