Closed unhammer closed 2 years ago
As suggested in https://github.com/apertium/lttoolbox/issues/167#issuecomment-1276703418 , stop doing the case-insensitive matching when we've got a high number of State sequences.
Currently 65536, quite high but at least within what most modern machines can deal with.
Also, delete FSTProcessor.current_state since confusingly all the processors (except transliteration) make a local State called current_state
$ cat a.dix <?xml version="1.0" encoding="UTF-8"?> <dictionary> <alphabet/> <sdefs> <sdef n="guess" c="Guesser"/> </sdefs> <pardefs> <pardef n="a-zA-Z+"> <e><re>[a-zA-Z]+</re></e> </pardef> </pardefs> <section id="regex" type="standard"> <e><par n="a-zA-Z+"/><p><l/><r><s n="guess"/></r></p></e> </section> </dictionary> $ lt-comp lr a.dix a.bin regex@standard 3 105 $ echo 'BADGERBADGERBADGERBAD' | \time lt-proc -w a.bin |head -c50 # BEFORE patch ^BADGERBADGERBADGERBAD/bAdGeRbAdGeRbAdGeRBAD<guessCommand terminated by signal 13 5.31user 1.08system 0:06.39elapsed 100%CPU (0avgtext+0avgdata 3815888maxresident)k 0inputs+0outputs (0major+1053716minor)pagefaults 0swaps $ echo 'BADGERBADGERBADGERAD' | \time lttoolbox/lt-proc -w a.bin |head -c50 # AFTER patch ^BADGERBADGERBADGERAD/bAdGeRbAdGeRBADGERAD<guess>/Command terminated by signal 13 0.16user 0.03system 0:00.20elapsed 102%CPU (0avgtext+0avgdata 65248maxresident)k 0inputs+0outputs (0major+17537minor)pagefaults 0swaps $ lt-comp rl a.dix g.bin regex@standard 3 105 $ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lt-proc -g g.bin # BEFORE patch BADGERBADGERBADGERBAD 3.21user 0.74system 0:03.96elapsed 99%CPU (0avgtext+0avgdata 2544316maxresident)k 0inputs+0outputs (0major+704011minor)pagefaults 0swaps $ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lttoolbox/lt-proc -g g.bin # AFTER patch BADGERBADGERBADGERBAD 0.02user 0.01system 0:00.02elapsed 118%CPU (0avgtext+0avgdata 8344maxresident)k 0inputs+0outputs (0major+2079minor)pagefaults 0swaps # b.bin from issue #167 $ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lt-proc -b b.bin # BEFORE patch ^BADGERBADGERBADGERBAD<guess>/BADGERBADGERBADGERBAD<guess>$ 3.96user 0.91system 0:04.87elapsed 99%CPU (0avgtext+0avgdata 2544488maxresident)k 0inputs+0outputs (0major+704013minor)pagefaults 0swaps $ echo '^BADGERBADGERBADGERBAD<guess>$' | \time lttoolbox/lt-proc -b b.bin # AFTER patch ^BADGERBADGERBADGERBAD<guess>/BADGERBADGERBADGERBAD<guess>$ 0.21user 0.02system 0:00.23elapsed 100%CPU (0avgtext+0avgdata 82504maxresident)k 0inputs+0outputs (0major+21817minor)pagefaults 0swaps
Tested both nob→dan (uses lt-proc -g) and nob→nno (uses lt-proc -g -b and lt-proc -p), no diffs on 40k lines of corpus, except nob→dan actually manages to get through its corpus now :)
As suggested in https://github.com/apertium/lttoolbox/issues/167#issuecomment-1276703418 , stop doing the case-insensitive matching when we've got a high number of State sequences.
Currently 65536, quite high but at least within what most modern machines can deal with.
Also, delete FSTProcessor.current_state since confusingly all the processors (except transliteration) make a local State called current_state
Tested both nob→dan (uses lt-proc -g) and nob→nno (uses lt-proc -g -b and lt-proc -p), no diffs on 40k lines of corpus, except nob→dan actually manages to get through its corpus now :)