apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

`lt-proc -g -b` should output @ symbol when there are unconsumed tags #182

Open unhammer opened 7 months ago

unhammer commented 7 months ago

For regular bidix lt-proc -b, we want to just copy over unconsumed tags and that is fine:

$ echo '^kake<n><m><unconsumed>$' |lt-proc -b nob-nno.autobil.bin
^kake<n><m><unconsumed>/kake<n><f><unconsumed>$

When using regular generation lt-proc -g, unconsumed tags lead to #-marks:

$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc -g nob-nno.autogen.bin
#kake

$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc --debugged-gen nob-nno.autogen.bin
#kake\<n\>\<f\>\<sg\>\<ind\>

But when using lt-proc in bilingual mode on a generator, we get the unconsumed tag without any debug symbol:

$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc -g -b nob-nno.autogen.bin
^kake<n><f><sg><ind><unconsumed>/kake<unconsumed>$

(while completely-unmatched words do get a @)

This can lead to hard-to-debug issues when we have a partial match; after the following cg-proc we just see the lemma as if it were the form and no hint about it not being found in the generator.

Ideally, when switch -b is given after -g (or -d), we would get an @ when there are unconsumed input tags. Note: we don't want an @ if there are output tags, e.g.

$ echo '^lykke<n><f><sg><ind>$' |lt-proc -g -b nob-nno.autogen.bin
^lykke<n><f><sg><ind>/lykke/lukke<v:lykke_lukke.vok-y2u>$

is still correct (here the whole input is consumed, there are no leftovers, but there is still a tag in output). But we want

$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc -g -b nob-nno.autogen.bin
^kake<n><f><sg><ind><unconsumed>/@kake$

and perhaps

$ echo '^kake<n><f><sg><ind><unconsumed>$' |lt-proc -d -b nob-nno.autogen.bin
^kake<n><f><sg><ind><unconsumed>/@kake\<unconsumed\>$

(though the details of -g vs -d are less important than just having the @ in there)