apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

lt-trim: new option --match-section #165

Closed unhammer closed 2 years ago

unhammer commented 2 years ago

May be given multiple times. Any section matching such a name (id@type) in the analyser will only be trimmed against sections with the same name in the bidix. Useful for regex sections, which tend to have a very different structure from regular entries (few states with lots of transitions + loops) – leading to slowdown when intersecting.

This gives a 4x speedup (60s → 15s) on nob→nno:

BEFORE:

$ \time lttoolbox/lttoolbox/lt-trim apertium-nob/nob.automorf.bin apertium-nno-nob/nob-nno.autobil.bin /tmp/before.bin final@inconditional 26 76
main@standard 168643 350041
regex@standard 403 7475
58.73user 0.97system 1:00.45elapsed 98%CPU (0avgtext+0avgdata 2280784maxresident)k 0inputs+3288outputs (0major+574892minor)pagefaults 0swaps

AFTER:

$ \time lttoolbox/lttoolbox/lt-trim --match-section=regex@standard apertium-nob/nob.automorf.bin apertium-nno-nob/nob-nno.autobil.bin /tmp/after.bin Matched sections regex@standard
final@inconditional 26 76
main@standard 168643 350041
regex@standard 389 7405
14.36user 0.24system 0:14.77elapsed 98%CPU (0avgtext+0avgdata 382136maxresident)k 0inputs+3288outputs (0major+102452minor)pagefaults 0swaps

(timings are the same if lt-comp -j was used to make nob.automorf.bin)

mr-martian commented 2 years ago

My original design of cli.h was with the intent that situations like this could easily have -s regex1 -s regex2, and I'm not sure whether this method of supporting both multiple arguments and comma separation is good or bad.

unhammer commented 2 years ago

Oh! I didn't know you could do that, that's much nicer!

unhammer commented 2 years ago

force-updated with nicer cli