Closed unhammer closed 2 years ago
https://github.com/apertium/lttoolbox/blob/44c8c94261bb006443f1c9260b114dbf537376b6/lttoolbox/fst_processor.cc#L260-L265 is a feature to skip soft hyphens (and optionally other characters) when analysing, but it fails in some cases, in particular with compounds that follow backtracking from mwe analyses.
E.g. ^riv/rive<vblex><inf>/rive<n><m><sg><ind>/rive<n><f><sg><ind>$ ^se/se<vblex><imp>/se<vblex><inf>$^teranlegget/*teranlegget$
^riv/rive<vblex><inf>/rive<n><m><sg><ind>/rive<n><f><sg><ind>$ ^se/se<vblex><imp>/se<vblex><inf>$^teranlegget/*teranlegget$
should be
^rive/rive<vblex><inf>/rive<n><m><sg><ind>/rive<n><f><sg><ind>$ ^seteranlegget/seter<n><m><sg><ind><cmp>+anlegg<n><nt><sg><def>/seter<n><f><sg><ind><cmp>+anlegg<n><nt><sg><def>$
There may be other cases as well, but it definitely happens when we've read a prefix of an word-with-spaces (e.g. rive bort is in dix), and then have to backtrack (input_buffer.back(int)).
rive bort
input_buffer.back(int)
https://github.com/apertium/lttoolbox/blob/44c8c94261bb006443f1c9260b114dbf537376b6/lttoolbox/fst_processor.cc#L260-L265 is a feature to skip soft hyphens (and optionally other characters) when analysing, but it fails in some cases, in particular with compounds that follow backtracking from mwe analyses.
E.g.
^riv/rive<vblex><inf>/rive<n><m><sg><ind>/rive<n><f><sg><ind>$ ^se/se<vblex><imp>/se<vblex><inf>$^teranlegget/*teranlegget$
should be
^rive/rive<vblex><inf>/rive<n><m><sg><ind>/rive<n><f><sg><ind>$ ^seteranlegget/seter<n><m><sg><ind><cmp>+anlegg<n><nt><sg><def>/seter<n><f><sg><ind><cmp>+anlegg<n><nt><sg><def>$
There may be other cases as well, but it definitely happens when we've read a prefix of an word-with-spaces (e.g.
rive bort
is in dix), and then have to backtrack (input_buffer.back(int)
).