apertium / apertium-separable

Module for reordering separable/discontiguous multiwords.
https://wiki.apertium.org/wiki/Apertium_separable
GNU General Public License v3.0
4 stars 5 forks source link

rule-initial <w/> can make other rules match #37

Closed unhammer closed 3 years ago

unhammer commented 3 years ago

It seems like a <w/> at the start of a rule can make the analyser move its position into a lexical unit even if the rule doesn't end up fully matching, allowing other rules to match from that point on.

apertium-nno-nob.nob-nno.lsx:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary type="separable">

  <alphabet></alphabet>

  <sdefs>
    <sdef n="adj"/>
  </sdefs>

  <pardefs>
    <pardef n="meh">
      <e><i>meh<s n="adj"/><t/><j/></i></e>
    </pardef>
  </pardefs>

  <section id="main" type="standard">

    <e c="override below rule if adj before">
      <i><w/>stuffnotininput<s n="adj"/><t/><j/></i>
      <i>DROP<s n="adj"/><t/><j/></i>
    </e>

    <e c="drop DROP and LEFT→RIGHT">
      <p><l>DROP<t/><j/></l> <r></r></p>
      <p><l>LEFT</l>           <r>RIGHT</r></p> <i><t/><j/></i>
    </e>
  </section>

</dictionary>
$ lsx-comp lr apertium-nno-nob.nob-nno.lsx nob-nno.autoseq.bin
main@standard 39 44

$ echo '^keptDROP<adj><sg>$ ^LEFT<n><sg>$' | lsx-proc nob-nno.autoseq.bin
^keptRIGHT<n><sg>$

None of the entries should've matched here, yet it seems like we had a partial match on the first one and then only backtracked back to where the second one was able to start matching (instead of backtracking outside of the word ^).

(thanks @victoria-tro for reporting)

unhammer commented 3 years ago

more minimal:

<?xml version="1.0" encoding="UTF-8"?>
<dictionary type="separable">

  <alphabet></alphabet>

  <sdefs>
    <sdef n="w"/>
    <sdef n="adj"/>
  </sdefs>

  <pardefs>
    <pardef n="meh">
      <e><i>meh<s n="adj"/><t/><j/></i></e>
    </pardef>
  </pardefs>

  <section id="main" type="standard">

    <e c="override below rule if w before">
      <i><w/><s n="w"/><t/><j/></i>
      <i>D<s n="adj"/><t/><j/></i>
    </e>

    <e c="drop DROP and LEFT→RIGHT">
      <p><l>D<t/><j/></l> <r></r></p>
      <p><l>L</l>           <r>R</r></p> <i><t/><j/></i>
    </e>
  </section>

</dictionary>

so we can view the fst:

$ lsx-comp lr apertium-nno-nob.nob-nno.lsx nob-nno.autoseq.bin
main@standard 14 19
$ lt-print nob-nno.autoseq.bin  > seq.att
$ printf 'read att seq.att\nview\n' | foma

seq

Seems like the problem is that the two paths get merged in the beginning there – why does that happen?

The path for the second rule should just be D (no optional side-tracking into ANY_CHAR).