Open marcriera opened 5 years ago
@Techievena
@unhammer I will definitely look into it.
I might be facing the same problem. I am using an input written in .att format to generate a weighted transducer.
0 1 c c 0.000000
1 2 a a 0.000000
2 3 t t 0.000000
3 4 @0@ <n> 0.000000
3 5 s <n> 0.000000
4 2.000000
5 6 @0@ <pl> 0.000000
6 1.000000
I generate the transducer using lt-comp lr in.att apert_model
.
The output of lt-print apert_model
is:
0 1 c c 0.000000
1 2 a a 0.000000
2 3 t t 0.000000
3 4 ε <n> 0.000000
3 5 s <n> 0.000000
4 7 ε ε 2.000000
5 6 ε <pl> 0.000000
6 7 ε ε 1.000000
7 0.000000
which seems to be correct.
However, the output of the echo 'cat' | lt-proc apert_model -W
seems to ignore the weights.
^cat/cat<n><W:0.000000>$
I think the bug might be related to this line and its following lines: https://github.com/apertium/lttoolbox/blob/f73c54162cc8ca1d9f70486b051165af1a7bf7cb/lttoolbox/state.cc#L607
I guess editing the comment on #49 to remove "Fix #44" was not enough to make Github understand it was not a closing merge.
@MarcRiera I think the bug is with the lt-comp command. Is lt-comp used in the apertium-eng to compile the dictionary?
I have prepared a sample dictionary:
<dictionary>
<alphabet>ÀÁÂÄÆÇÈÉÊËÌÍÎÏÑÒÓÔÖÙÚÛÜàáâäçèéêëìíîïñòóôöùúûüABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz</alphabet>
<sdefs>
<sdef n="n" c="Noun"/>
<sdef n="vblex" c="Verb"/>
<sdef n="p1" c="First person"/>
<sdef n="p3" c="Third person"/>
<sdef n="sg" c="Singular"/>
<sdef n="pl" c="Plural"/>
<sdef n="pres" c="Present (tense)"/>
<sdef n="past" c="Past"/>
<sdef n="imp" c="Imperative"/>
<sdef n="inf" c="Infinitive"/>
<sdef n="pp" c="Past participle"/>
<sdef n="subs" c="Verbal noun"/>
<sdef n="pprs" c="Present participle"/>
<sdef n="ger" c="Gerund"/>
</sdefs>
<pardefs>
<pardef n="liv/e__vblex">
<e> <p><l>e</l> <r>e<s n="vblex"/><s n="inf"/></r></p></e>
<e> <p><l>e</l> <r>e<s n="vblex"/><s n="imp"/></r></p></e>
<e> <p><l>ed</l> <r>e<s n="vblex"/><s n="pp"/></r></p></e>
<e w="1"> <p><l>ing</l> <r>e<s n="vblex"/><s n="pprs"/></r></p></e>
<e w="3"> <p><l>ing</l> <r>e<s n="vblex"/><s n="ger"/></r></p></e>
<e w="2"> <p><l>ing</l> <r>e<s n="vblex"/><s n="subs"/></r></p></e>
<e> <p><l>e</l> <r>e<s n="vblex"/><s n="pres"/></r></p></e>
<e> <p><l>es</l> <r>e<s n="vblex"/><s n="pres"/><s n="p3"/><s n="sg"/></r></p></e>
<e> <p><l>ed</l> <r>e<s n="vblex"/><s n="past"/></r></p></e>
</pardef>
<pardef n="house__n">
<e> <p><l></l> <r><s n="n"/><s n="sg"/></r></p></e>
<e r="RL"><p><l>'s</l> <r><s n="n"/><s n="sg"/><j/>'s<s n="gen"/></r></p></e>
<e> <p><l>s</l> <r><s n="n"/><s n="pl"/></r></p></e>
<e r="RL"><p><l>s'</l> <r><s n="n"/><s n="pl"/><j/>'s<s n="gen"/></r></p></e>
</pardef>
</pardefs>
<section id="main" type="standard">
<e lm="house" w="1"> <i>house</i><par n="house__n"/></e>
<e lm="house" w="2"> <i>hous</i><par n="liv/e__vblex"/></e>
</section>
</dictionary>
And the output transducer isn't correct
0 1 h h 0.000000
1 2 o o 0.000000
2 3 u u 0.000000
3 4 s s 0.000000
4 5 e e 0.000000 # THIS EDGE SHOULD HAVE WEIGHT=2
4 6 e e 1.000000 # THIS EDGE HAVE A CORRECT WEIGHT!!
4 7 i e 0.000000
5 8 ε <vblex> 0.000000
5 9 d <vblex> 0.000000
5 10 s <vblex> 0.000000
6 11 ε <n> 0.000000
6 12 s <n> 0.000000
7 13 n <vblex> 0.000000
8 14 ε <inf> 0.000000
8 14 ε <imp> 0.000000
8 14 ε <pres> 0.000000
9 14 ε <pp> 0.000000
9 14 ε <past> 0.000000
10 15 ε <pres> 0.000000
11 14 ε <sg> 0.000000
12 14 ε <pl> 0.000000
13 14 g <pprs> 1.000000
13 14 g <ger> 3.000000
13 14 g <subs> 2.000000
15 11 ε <p3> 0.000000
14 0.000000
When I use the command echo "house" | lt-proc house.bin -W
I get only correct weights for the noun analysis:
^house/house<vblex><inf><W:0.000000>/house<vblex><imp><W:0.000000>/house<vblex><pres><W:0.000000>/house<n><sg><W:1.000000>$
the correct weighting here is not trivial (so there seems to be something wrong in the compilation part too), keep in mind that the prefix "hous" is shared by both verb and noun, and the verb that needs that weight of 2 needs it also for "housing" which does not go through the "4 5 e e" arc.
Here's the hfst + lexc equivalent for reference:
$ ▓▒cat house.lexc
Multichar_Symbols
%<n%>
%<vblex%>
%<p1%>
%<p3%>
%<sg%>
%<pl%>
%<pres%>
%<past%>
%<imp%>
%<inf%>
%<pp%>
%<subs%>
%<pprs%>
%<ger%>
%<gen%>
LEXICON Root
house:house house__n "weight: 1" ;
hous:hous liv/e__vblex "weight: 2" ;
LEXICON liv/e__vblex
e%<vblex%>%<inf%>:e # ;
e%<vblex%>%<imp%>:e # ;
e%<vblex%>%<pp%>:ed # ;
e%<vblex%>%<pprs%>:ing # "weight: 1" ;
e%<vblex%>%<ger%>:ing # "weight: 2" ;
e%<vblex%>%<subs%>:ing # "weight: 3" ;
e%<vblex%>%<pres%>:e # ;
e%<vblex%>%<pres%>%<p3%>%<sg%>:es # ;
e%<vblex%>%<past%>:ed # ;
LEXICON house__n
%<n%>%<sg%>:0 # ;
%<n%>%<sg%>+'s%<gen%>:'s # ;
%<n%>%<pl%>:s # ;
%<n%>%<pl%>+'s%<gen%>:s' # ;
$ ▓▒hfst-lexc house.lexc | hfst-fst2txt
hfst-lexc: warning: Defaulting to OpenFst tropical type
Root...2 liv/e__vblex...9 house__n...
0 1 h h 1.000000
1 2 o o 0.000000
2 3 u u 0.000000
3 4 s s 0.000000
4 5 e i 2.000000
4 6 e e 0.000000
5 7 <vblex> n 0.000000
6 8 <n> @0@ 0.000000
6 9 <n> s 0.000000
6 10 <n> ' 0.000000
6 11 <vblex> @0@ 1.000000
6 12 <vblex> s 1.000000
6 13 <vblex> d 1.000000
7 14 <subs> g 2.000000
7 14 <ger> g 1.000000
7 14 <pprs> g 0.000000
8 14 <sg> @0@ 0.000000
9 14 <pl> @0@ 0.000000
9 15 <pl> ' 0.000000
10 15 <sg> s 0.000000
11 14 <pres> @0@ 0.000000
11 14 <imp> @0@ 0.000000
11 14 <inf> @0@ 0.000000
12 16 <pres> @0@ 0.000000
13 14 <past> @0@ 0.000000
13 14 <pp> @0@ 0.000000
14 0.000000
15 17 + @0@ 0.000000
16 8 <p3> @0@ 0.000000
17 18 ' @0@ 0.000000
18 19 s @0@ 0.000000
19 14 <gen> @0@ 0.000000
$ ▓▒hfst-lexc house.lexc | hfst-fst2strings -w
hfst-lexc: warning: Defaulting to OpenFst tropical type
Root...2 liv/e__vblex...9 house__n...
house<vblex><subs>:housing 5
house<vblex><ger>:housing 4
house<vblex><pprs>:housing 3
house<n><sg>:house 1
house<n><pl>:houses 1
house<n><pl>+'s<gen>:houses' 1
house<n><sg>+'s<gen>:house's 1
house<vblex><pres>:house 2
house<vblex><imp>:house 2
house<vblex><inf>:house 2
house<vblex><pres><p3><sg>:houses 2
house<vblex><past>:housed 2
nonetheless for the lt-proc part there should be at least a bit more of the weight accumulated :-/
Is lt-comp used in the apertium-eng to compile the dictionary?
it is
I believe the issue here is that Transducer::closure()
disregards weight and as a result determinize()
and minimize()
lose any weights which are on epsilon transitions.
@mr-martian, it seems at some point you attempted to fix it, but then had to revert. Any idea on what needs to be done?
The issue is that FST minimization was written for unweighted automata and when weight support for added, closure()
and/or minimize()
were updated incorrectly and my first attempt at fixing it failed. So someone who knows FST algorithms better than me needs to go through that code.
Given the following paradigms and entries:
lt-proc seems to ignore the weights for the entries:
The expected result would be:
However, the weights work fine when they are used inside a paradigm: