apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

hardcoded sanity-max state size for case-insensitive matching #168

Closed unhammer closed 2 years ago

unhammer commented 2 years ago

As suggested in https://github.com/apertium/lttoolbox/issues/167#issuecomment-1276703418 , stop doing the case-insensitive matching when we've got a high number of State sequences.

Currently 65536, quite high but at least within what most modern machines can deal with.

Also, delete FSTProcessor.current_state since confusingly all the processors (except transliteration) make a local State called current_state

$ cat a.dix 
<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
<alphabet/>
<sdefs>
  <sdef n="guess"       c="Guesser"/>
</sdefs>

<pardefs>
<pardef n="a-zA-Z+">
  <e><re>[a-zA-Z]+</re></e>
</pardef>
</pardefs>

<section id="regex" type="standard">
<e><par n="a-zA-Z+"/><p><l/><r><s n="guess"/></r></p></e>
</section>
</dictionary>

$ lt-comp lr a.dix a.bin
regex@standard 3 105

$ echo 'BADGERBADGERBADGERBAD' | \time lt-proc -w a.bin |head -c50 # BEFORE patch
^BADGERBADGERBADGERBAD/bAdGeRbAdGeRbAdGeRBAD<guessCommand terminated by signal 13
5.31user 1.08system 0:06.39elapsed 100%CPU (0avgtext+0avgdata 3815888maxresident)k
0inputs+0outputs (0major+1053716minor)pagefaults 0swaps

$ echo 'BADGERBADGERBADGERAD' | \time lttoolbox/lt-proc -w a.bin |head -c50 # AFTER patch
^BADGERBADGERBADGERAD/bAdGeRbAdGeRBADGERAD<guess>/Command terminated by signal 13
0.16user 0.03system 0:00.20elapsed 102%CPU (0avgtext+0avgdata 65248maxresident)k
0inputs+0outputs (0major+17537minor)pagefaults 0swaps

$ lt-comp rl a.dix g.bin
regex@standard 3 105

$ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lt-proc -g g.bin # BEFORE patch
BADGERBADGERBADGERBAD
3.21user 0.74system 0:03.96elapsed 99%CPU (0avgtext+0avgdata 2544316maxresident)k
0inputs+0outputs (0major+704011minor)pagefaults 0swaps

$ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lttoolbox/lt-proc -g g.bin # AFTER patch
BADGERBADGERBADGERBAD
0.02user 0.01system 0:00.02elapsed 118%CPU (0avgtext+0avgdata 8344maxresident)k
0inputs+0outputs (0major+2079minor)pagefaults 0swaps

# b.bin from issue #167 
$ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lt-proc -b b.bin # BEFORE patch
^BADGERBADGERBADGERBAD<guess>/BADGERBADGERBADGERBAD<guess>$
3.96user 0.91system 0:04.87elapsed 99%CPU (0avgtext+0avgdata 2544488maxresident)k
0inputs+0outputs (0major+704013minor)pagefaults 0swaps

$ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lttoolbox/lt-proc -b b.bin # AFTER patch
^BADGERBADGERBADGERBAD<guess>/BADGERBADGERBADGERBAD<guess>$
0.21user 0.02system 0:00.23elapsed 100%CPU (0avgtext+0avgdata 82504maxresident)k
0inputs+0outputs (0major+21817minor)pagefaults 0swaps

Tested both nob→dan (uses lt-proc -g) and nob→nno (uses lt-proc -g -b and lt-proc -p), no diffs on 40k lines of corpus, except nob→dan actually manages to get through its corpus now :)