Closed mr-martian closed 2 years ago
I tested this on -oci and got |
setup | invocations of lt-comp |
time |
---|---|---|---|
current | 6 | ~7:30 | |
merging variants | 1 | ~14:00 | |
merging directions | 3 | ~6:00 |
I also observed a slowdown in runtime, which, if it's due to the different fst structure would roughly cancel out the benefits if your workflow involves running a large corpus through the pipeline after each recompilation.
It would probably also be worth checking whether a language with less divergence between variants would have as much of a slowdown from merging them.
And I should add tests.
I tested this on -cat and got |
setup | invocations of lt-comp |
time |
---|---|---|---|
current | 4 | 2:07 | |
merging variants | 1 | 1:23 | |
merging, release mode | 1 | 2:01 |
So it seems that the usefulness of this will need to be determined on a language-by-language basis.
I also made lt-restrict rl
invert the transducers and lt-comp
accept more than one variant or alt value separated by spaces (replacing apertium-genvdix
).
The goal of this PR to make it so that in place of
we can instead write
Why, you might ask, would we want to replace 2 commands with 4 (or 3, if I make
lt-restrict
invert the fst when the direction isrl
)? Well, ifLT_RELEASE
is unset or is set tono
,lt-restrict
will not minimize the transducer (which, even after recent optimizations, is still by far the biggest piece of the process), significantly cutting down on overall compile time, especially for languages like-oci
where the dictionary is getting compiled 6 times.This PR is a draft because in order for this to be fully usable, I need to also write a tool to apply an ACX file to an already-compiled transducer.
Oh, and I wrote a wrapper around
getopt
because I was tired of typing the same boilerplate over and over again.