apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Support compound analyses where left part has spaces #139

Closed unhammer closed 2 years ago

unhammer commented 2 years ago

So if "kake" and "formel 1-" are in dix, we can analyse "formel 1-kake" as a compound. One left-part has to have all the spaces (so "kakeformel 1" isn't supported, nor is "formel 1-formel 1").

Only takes effect when run with -e option.

This closes #138


It makes analysis slightly slower when compounding is in effect (and your FST has multiwords with compounding), but lt-proc is far from being the bottleneck. For nob-dan, it's negligible: 13.8s vs 13.6s on 50k lines. For nob-nno, 19.9s vs 17.9s on 50k lines.

For Norwegian, these types of compounds tend to always have a dash, but it was simpler to implement without that requirement, so .dix writers get to (have to) decide if they want space words to be able to cp-L without dash.

Caveat: If the dix already has space words which allow cp-L but shouldn't, they'll now start compounding. This should be fixed in dix anyway, but it'll alter translations. I think I maintain most of the .dix that use compounding though :)

unhammer commented 2 years ago

(Discovered https://github.com/apertium/lttoolbox/issues/140 while checking for regressions, but that's a separate issue, I think this can be merged regardless.)

unhammer commented 2 years ago

merged as https://github.com/apertium/lttoolbox/commit/97e29776d71b535a9b9f74ef5cdf5f31220436b3