apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Unidirectional Compilation #155

Closed mr-martian closed 2 years ago

mr-martian commented 2 years ago

The goal of this PR to make it so that in place of

lt-comp lr spa.dix spa.automorf.bin # compile left-to-right and general paths
lt-comp rl spa.dix spa.autogen.bin  # compile right-to-left and general paths

we can instead write

lt-comp u spa.dix .deps/spa.dix.bin               # compile all paths, marking ones that have restrictions
lt-restrict lr .deps/spa.dix.bin spa.automorf.bin # remove right-to-left paths
lt-restrict rl .deps/spa.dix.bin .deps/spa.RL.bin # remove left-to-right paths
lt-invert .deps/spa.RL.bin spa.autogen.bin        # invert

Why, you might ask, would we want to replace 2 commands with 4 (or 3, if I make lt-restrict invert the fst when the direction is rl)? Well, if LT_RELEASE is unset or is set to no, lt-restrict will not minimize the transducer (which, even after recent optimizations, is still by far the biggest piece of the process), significantly cutting down on overall compile time, especially for languages like -oci where the dictionary is getting compiled 6 times.

This PR is a draft because in order for this to be fully usable, I need to also write a tool to apply an ACX file to an already-compiled transducer.

Oh, and I wrote a wrapper around getopt because I was tired of typing the same boilerplate over and over again.

mr-martian commented 2 years ago
I tested this on -oci and got setup invocations of lt-comp time
current 6 ~7:30
merging variants 1 ~14:00
merging directions 3 ~6:00

I also observed a slowdown in runtime, which, if it's due to the different fst structure would roughly cancel out the benefits if your workflow involves running a large corpus through the pipeline after each recompilation.

It would probably also be worth checking whether a language with less divergence between variants would have as much of a slowdown from merging them.

And I should add tests.

mr-martian commented 2 years ago
I tested this on -cat and got setup invocations of lt-comp time
current 4 2:07
merging variants 1 1:23
merging, release mode 1 2:01

So it seems that the usefulness of this will need to be determined on a language-by-language basis.

I also made lt-restrict rl invert the transducers and lt-comp accept more than one variant or alt value separated by spaces (replacing apertium-genvdix).