apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

soft hyphens not always ignored #140

Closed unhammer closed 2 years ago

unhammer commented 2 years ago

https://github.com/apertium/lttoolbox/blob/44c8c94261bb006443f1c9260b114dbf537376b6/lttoolbox/fst_processor.cc#L260-L265 is a feature to skip soft hyphens (and optionally other characters) when analysing, but it fails in some cases, in particular with compounds that follow backtracking from mwe analyses.

E.g. ^riv/rive<vblex><inf>/rive<n><m><sg><ind>/rive<n><f><sg><ind>$ ^se/se<vblex><imp>/se<vblex><inf>$­^teranlegget/*teranlegget$

should be

^rive/rive<vblex><inf>/rive<n><m><sg><ind>/rive<n><f><sg><ind>$ ^seteranlegget/seter<n><m><sg><ind><cmp>+anlegg<n><nt><sg><def>/seter<n><f><sg><ind><cmp>+anlegg<n><nt><sg><def>$

There may be other cases as well, but it definitely happens when we've read a prefix of an word-with-spaces (e.g. rive bort is in dix), and then have to backtrack (input_buffer.back(int)).