apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

garden-path mwe's cause superblanks to be moved #47

Closed unhammer closed 5 years ago

unhammer commented 5 years ago

If you have legge# opp til in your monolingual analyser, and try to analyse input

legge opp<br/>blah

in html-format, lt-proc will shift the <br/> into the middle of the analysis:

$ echo 'legge opp<br/>blah' |apertium-deshtml 
legge opp[<br\/>]blah.[][
]

↑ here it's still at the end

$ echo 'legge opp<br/>blah' |apertium-deshtml |lt-proc -we ../apertium-nno-nob/nob-nno.automorf.bin
^legge/legge<vblex><inf>$[<br\/>]^opp/opp<pr>/opp<adv>/oppe<vblex><imp>$ ^blah/*blah$^./.<sent><clb>$[][
]

but ↑ here it's in the middle of the multiword.

From the code, it seems like what happens is that we

  1. read until legge, we've now seen a nonalphabetic after a final, so the index last=6 and lf=/legge<vblex><inf>.
  2. read further until legge opp[<br/>] , where we still don't know if we'll see til at the right, so [<br/>] ends up in blankqueue
  3. see b, meaning we can't go further in that mwe, so we have to skip back to the last full analysis
  4. call printWord with surface form legge
  5. call printSpace, which completely flushes blankqueue if there is one, otherwise outputs a space.