apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Weights are ignored in monolingual dictionary entries #44

Open marcriera opened 5 years ago

marcriera commented 5 years ago

Given the following paradigms and entries:

<pardef n="liv/e__vblex">
  <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="inf"/></r></p></e>
  <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="imp"/></r></p></e>
  <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="pp"/></r></p></e>
  <e w="1"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="pprs"/></r></p></e>
  <e w="3"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="ger"/></r></p></e>
  <e w="2"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="subs"/></r></p></e>
  <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="pres"/></r></p></e>
  <e>       <p><l>es</l>        <r>e<s n="vblex"/><s n="pres"/><s n="p3"/><s n="sg"/></r></p></e>
  <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="past"/></r></p></e>
</pardef>
<pardef n="house__n">
  <e>       <p><l></l>          <r><s n="n"/><s n="sg"/></r></p></e>
  <e r="RL"><p><l>'s</l>        <r><s n="n"/><s n="sg"/><j/>'s<s n="gen"/></r></p></e>
  <e>       <p><l>s</l>         <r><s n="n"/><s n="pl"/></r></p></e>
  <e r="RL"><p><l>s'</l>        <r><s n="n"/><s n="pl"/><j/>'s<s n="gen"/></r></p></e>
</pardef>
<e lm="house" w="1">     <i>house</i><par n="house__n"/></e>
<e lm="house" w="2">     <i>hous</i><par n="liv/e__vblex"/></e>

lt-proc seems to ignore the weights for the entries:

$ echo "house" | lt-proc -wW eng-cat.automorf.bin
^house/house<n><sg><W:0.000000>/house<vblex><inf><W:0.000000>/house<vblex><pres><W:0.000000>/house<vblex><imp><W:0.000000>$

The expected result would be:

$ echo "house" | lt-proc -wW eng-cat.automorf.bin
^house/house<n><sg><W:1.000000>/house<vblex><inf><W:2.000000>/house<vblex><pres><W:2.000000>/house<vblex><imp><W:2.000000>$

However, the weights work fine when they are used inside a paradigm:

$ echo "housing" | lt-proc -wW eng-cat.automorf.bin
^housing/housing<n><sg><W:0.000000>/house<vblex><pprs><W:1.000000>/house<vblex><subs><W:2.000000>/house<vblex><ger><W:3.000000>$
unhammer commented 5 years ago

@Techievena

Techievena commented 5 years ago

@unhammer I will definitely look into it.

AMR-KELEG commented 5 years ago

I might be facing the same problem. I am using an input written in .att format to generate a weighted transducer.

0       1       c       c       0.000000
1       2       a       a       0.000000
2       3       t       t       0.000000
3       4       @0@     <n>     0.000000
3       5       s       <n>     0.000000
4       2.000000
5       6       @0@     <pl>    0.000000
6       1.000000

I generate the transducer using lt-comp lr in.att apert_model. The output of lt-print apert_model is:

0       1       c       c       0.000000
1       2       a       a       0.000000
2       3       t       t       0.000000
3       4       ε       <n>     0.000000
3       5       s       <n>     0.000000
4       7       ε       ε       2.000000
5       6       ε       <pl>    0.000000
6       7       ε       ε       1.000000
7       0.000000

which seems to be correct.

However, the output of the echo 'cat' | lt-proc apert_model -W seems to ignore the weights. ^cat/cat<n><W:0.000000>$

AMR-KELEG commented 5 years ago

I think the bug might be related to this line and its following lines: https://github.com/apertium/lttoolbox/blob/f73c54162cc8ca1d9f70486b051165af1a7bf7cb/lttoolbox/state.cc#L607

TinoDidriksen commented 5 years ago

I guess editing the comment on #49 to remove "Fix #44" was not enough to make Github understand it was not a closing merge.

AMR-KELEG commented 5 years ago

@MarcRiera I think the bug is with the lt-comp command. Is lt-comp used in the apertium-eng to compile the dictionary?

I have prepared a sample dictionary:

<dictionary>
  <alphabet>ÀÁÂÄÆÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
  <sdefs>
    <sdef n="n"   c="Noun"/>  
    <sdef n="vblex"   c="Verb"/> 
    <sdef n="p1"  c="First person"/> 
    <sdef n="p3"  c="Third person"/> 
    <sdef n="sg"  c="Singular"/> 
    <sdef n="pl"  c="Plural"/> 
    <sdef n="pres"  c="Present (tense)"/> 
    <sdef n="past"  c="Past"/> 
    <sdef n="imp"   c="Imperative"/> 
    <sdef n="inf"   c="Infinitive"/> 
    <sdef n="pp"  c="Past participle"/> 
    <sdef n="subs"  c="Verbal noun"/> 
    <sdef n="pprs"  c="Present participle"/> 
    <sdef n="ger"   c="Gerund"/> 
  </sdefs>
  <pardefs>
    <pardef n="liv/e__vblex">
      <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="inf"/></r></p></e>
      <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="imp"/></r></p></e>
      <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="pp"/></r></p></e>
      <e w="1"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="pprs"/></r></p></e>
      <e w="3"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="ger"/></r></p></e>
      <e w="2"> <p><l>ing</l>       <r>e<s n="vblex"/><s n="subs"/></r></p></e>
      <e>       <p><l>e</l>         <r>e<s n="vblex"/><s n="pres"/></r></p></e>
      <e>       <p><l>es</l>        <r>e<s n="vblex"/><s n="pres"/><s n="p3"/><s n="sg"/></r></p></e>
      <e>       <p><l>ed</l>        <r>e<s n="vblex"/><s n="past"/></r></p></e>
    </pardef>
    <pardef n="house__n">
      <e>       <p><l></l>          <r><s n="n"/><s n="sg"/></r></p></e>
      <e r="RL"><p><l>'s</l>        <r><s n="n"/><s n="sg"/><j/>'s<s n="gen"/></r></p></e>
      <e>       <p><l>s</l>         <r><s n="n"/><s n="pl"/></r></p></e>
      <e r="RL"><p><l>s'</l>        <r><s n="n"/><s n="pl"/><j/>'s<s n="gen"/></r></p></e>
    </pardef>
  </pardefs>
<section id="main" type="standard">
  <e lm="house" w="1">     <i>house</i><par n="house__n"/></e>
  <e lm="house" w="2">     <i>hous</i><par n="liv/e__vblex"/></e>
</section>
</dictionary>

And the output transducer isn't correct

0   1   h   h   0.000000    
1   2   o   o   0.000000    
2   3   u   u   0.000000    
3   4   s   s   0.000000    
4   5   e   e   0.000000     # THIS EDGE SHOULD HAVE WEIGHT=2
4   6   e   e   1.000000 # THIS EDGE HAVE A CORRECT WEIGHT!!    
4   7   i   e   0.000000    
5   8   ε   <vblex> 0.000000    
5   9   d   <vblex> 0.000000    
5   10  s   <vblex> 0.000000    
6   11  ε   <n> 0.000000    
6   12  s   <n> 0.000000    
7   13  n   <vblex> 0.000000    
8   14  ε   <inf>   0.000000    
8   14  ε   <imp>   0.000000    
8   14  ε   <pres>  0.000000    
9   14  ε   <pp>    0.000000    
9   14  ε   <past>  0.000000    
10  15  ε   <pres>  0.000000    
11  14  ε   <sg>    0.000000    
12  14  ε   <pl>    0.000000    
13  14  g   <pprs>  1.000000    
13  14  g   <ger>   3.000000    
13  14  g   <subs>  2.000000    
15  11  ε   <p3>    0.000000    
14  0.000000

When I use the command echo "house" | lt-proc house.bin -W I get only correct weights for the noun analysis:

^house/house<vblex><inf><W:0.000000>/house<vblex><imp><W:0.000000>/house<vblex><pres><W:0.000000>/house<n><sg><W:1.000000>$
flammie commented 5 years ago

the correct weighting here is not trivial (so there seems to be something wrong in the compilation part too), keep in mind that the prefix "hous" is shared by both verb and noun, and the verb that needs that weight of 2 needs it also for "housing" which does not go through the "4 5 e e" arc.

Here's the hfst + lexc equivalent for reference:

 $ ▓▒cat house.lexc 
Multichar_Symbols
%<n%>
%<vblex%>
%<p1%>
%<p3%>
%<sg%>
%<pl%>
%<pres%>
%<past%>
%<imp%>
%<inf%>
%<pp%>
%<subs%>
%<pprs%>
%<ger%>
%<gen%>

LEXICON Root

house:house house__n "weight: 1" ;
hous:hous liv/e__vblex "weight: 2" ;

LEXICON liv/e__vblex

e%<vblex%>%<inf%>:e # ;
e%<vblex%>%<imp%>:e # ;
e%<vblex%>%<pp%>:ed # ;
e%<vblex%>%<pprs%>:ing # "weight: 1" ;
e%<vblex%>%<ger%>:ing  # "weight: 2" ;
e%<vblex%>%<subs%>:ing  # "weight: 3" ;
e%<vblex%>%<pres%>:e # ;
e%<vblex%>%<pres%>%<p3%>%<sg%>:es # ;
e%<vblex%>%<past%>:ed # ;

LEXICON house__n

%<n%>%<sg%>:0  # ;
%<n%>%<sg%>+'s%<gen%>:'s  # ;
%<n%>%<pl%>:s  # ;
%<n%>%<pl%>+'s%<gen%>:s'  # ;

$ ▓▒hfst-lexc house.lexc | hfst-fst2txt 
hfst-lexc: warning: Defaulting to OpenFst tropical type
Root...2 liv/e__vblex...9 house__n...
0   1   h   h   1.000000
1   2   o   o   0.000000
2   3   u   u   0.000000
3   4   s   s   0.000000
4   5   e   i   2.000000
4   6   e   e   0.000000
5   7   <vblex> n   0.000000
6   8   <n> @0@ 0.000000
6   9   <n> s   0.000000
6   10  <n> '   0.000000
6   11  <vblex> @0@ 1.000000
6   12  <vblex> s   1.000000
6   13  <vblex> d   1.000000
7   14  <subs>  g   2.000000
7   14  <ger>   g   1.000000
7   14  <pprs>  g   0.000000
8   14  <sg>    @0@ 0.000000
9   14  <pl>    @0@ 0.000000
9   15  <pl>    '   0.000000
10  15  <sg>    s   0.000000
11  14  <pres>  @0@ 0.000000
11  14  <imp>   @0@ 0.000000
11  14  <inf>   @0@ 0.000000
12  16  <pres>  @0@ 0.000000
13  14  <past>  @0@ 0.000000
13  14  <pp>    @0@ 0.000000
14  0.000000
15  17  +   @0@ 0.000000
16  8   <p3>    @0@ 0.000000
17  18  '   @0@ 0.000000
18  19  s   @0@ 0.000000
19  14  <gen>   @0@ 0.000000

$ ▓▒hfst-lexc house.lexc | hfst-fst2strings  -w
hfst-lexc: warning: Defaulting to OpenFst tropical type
Root...2 liv/e__vblex...9 house__n...
house<vblex><subs>:housing  5
house<vblex><ger>:housing   4
house<vblex><pprs>:housing  3
house<n><sg>:house  1
house<n><pl>:houses 1
house<n><pl>+'s<gen>:houses'    1
house<n><sg>+'s<gen>:house's    1
house<vblex><pres>:house    2
house<vblex><imp>:house 2
house<vblex><inf>:house 2
house<vblex><pres><p3><sg>:houses   2
house<vblex><past>:housed   2

nonetheless for the lt-proc part there should be at least a bit more of the weight accumulated :-/

unhammer commented 5 years ago

Is lt-comp used in the apertium-eng to compile the dictionary?

it is

mr-martian commented 2 years ago

I believe the issue here is that Transducer::closure() disregards weight and as a result determinize() and minimize() lose any weights which are on epsilon transitions.

xavivars commented 1 year ago

@mr-martian, it seems at some point you attempted to fix it, but then had to revert. Any idea on what needs to be done?

mr-martian commented 1 year ago

The issue is that FST minimization was written for unweighted automata and when weight support for added, closure() and/or minimize() were updated incorrectly and my first attempt at fixing it failed. So someone who knows FST algorithms better than me needs to go through that code.