apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

lt-trim trims valid analyses #156

Closed marcriera closed 2 years ago

marcriera commented 2 years ago

lt-trim seems to trim valid analyses containing +.

For the text I+D, spa.automorf.bin correctly returns ^I+D/I+D<n><acr><f><sg>$. However, spa-cat.automorf.bin doesn't return any analysis despite the bidix containing a valid entry.

I've tried expanding the pair analysis transducer and the entry seems to be missing. Running the pair pipeline with spa.automorf.bin instead of the trimmed results in a valid translation.

mr-martian commented 2 years ago

So the issue here is that I don't think there's any code currently that makes a distinction between + as a normal character and + as a compound separator (same issue in https://github.com/apertium/apertium/issues/171).

The solution should probably be to only treat + as a compound separator if the preceding symbol is a tag.

That doesn't cover the case where the second element of a compound has a lemma beginning with + in which case ... please don't.

mr-martian commented 2 years ago

On second thought, lemma beginning with a + would be ok because then the compound would have ++.

mr-martian commented 2 years ago

Or, in this specific case, we can do the even dumber thing and when we see a + run the code paths for both word boundaries and normal symbols, which appears to work fine (no regressions on oci-fra).

marcriera commented 2 years ago

It works now without issues. Thanks for the quick fix!