apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Properly disjunct multiple FSTs encoded in the same at&t file #60

Closed AMR-KELEG closed 5 years ago

AMR-KELEG commented 5 years ago

References #59. Fixes #56.

AMR-KELEG commented 5 years ago

Should negative state ids be allowed?

unhammer commented 5 years ago

The type at least is signed https://github.com/apertium/lttoolbox/blob/master/lttoolbox/transducer.h#L46 – I don't know if there is any code that depends on it being positive. OTOH, I've never seen negative state ids. How does it fail on negative state ids? (If it crashes or hangs, that's better than giving a wrong result …)


Why does it say "final@inconditional" when it compiles inputs that have no punctuation characters?

$ lttoolbox/lt-comp lr /tmp/simple.att /tmp/simple2.bin
Warning: Multiple fsts in '/tmp/simple.att' will be disjuncted.
main@standard 5 4
final@inconditional 3 2

Also, it doesn't print the disjuncted ones longer:

$ lt-print /tmp/simple2.bin
Error: empty set of final states

(which is weird, it does analyse the inputs – did it add an extra transducer without final states?) Fortunately printing works for the plain .att's.

On the plus side, there's no noticable speed difference even with two passes over the file (~1.85s vs ~1.75s on a 13M .att file).

AMR-KELEG commented 5 years ago

The patch just adds a new initial state. I commented the line that terminates the lt-print and here is the full output:

$ cat sample.att 
0       1       i       i
1       2       s       s
2       3       n       n
3       4       '       '
4       5       t       t
5       1.00
--
0       1       w       w
1       2       e       e
2       3       '       '
3       4       l       l
4       5       l       l
5       2.00

$ lt-print sample.bin
Error: empty set of final states
0   1   ε   ε   0.000000    
0   2   ε   ε   0.000000    
--
0   1   ε   ε   0.000000    
0   7   ε   ε   0.000000    
1   2   i   i   0.000000    
2   3   s   s   0.000000    
3   4   n   n   0.000000    
4   5   '   '   0.000000    
5   6   t   t   0.000000    
6   13  ε   ε   1.000000    
7   8   w   w   0.000000    
8   9   e   e   0.000000    
9   10  '   '   0.000000    
10  11  l   l   0.000000    
11  12  l   l   0.000000    
12  13  ε   ε   2.000000    
13  0.000000

I am not sure why is the final@conditional transducer extracted! Additionally, the two epsilon transitions are strange and doesn't represent actual transitions. Seems like it's a bug in the lt-comp command. I believe my patch isn't the source of this bug.

AMR-KELEG commented 5 years ago

Additionally, the patch will fail for negative state ids. I can open another issue so that we don't forget this current limitation and fix it later.