Give chunks both a source and target side?

ftyers commented 4 years ago

It occurs to me that it would often be helpful to have both a source and a target side of chunks, e.g.

^aj<SV><tv><s_pl3><o_sg3>/querer<SV><tv><pri><p3><pl>{
    ^aj<V><tv><s_pl3><o_sg3>/querer<V><tv><3><4><5>{
        ^$
        ^$
        ^querer<vblex><3><4><5>$
    }$
    ^kamisaj<SV><tv><s_pl3><o_sg3>/matar<SV><tv><pri><p3><pl>{
        ^kamisaj<V><tv><s_pl3><o_sg3>/matar<V><tv><pri><p3><pl>{
            ^$
            ^$
            ^matar<vblex><3><4><5>$
        }$
        ^ri<SD>/el<SD><m><sg>{
            ^ant<SD>/ant<SD><2><3>{
                ^Lázaro<np><ant><m><sg>$
            }$
        }$
    }$
}$

This would allow us to check the original agreement (for determining complement type) but also link tags to the translated agreement.

I'd say that matching should probably be done only on the TL, but the SL would be there as essentially a place to filter source-language information up the tree.

marcriera commented 4 years ago

This would also be helpful to match lexical units from the source side beyond the first level of chunking, which is not currently possible.

mr-martian commented 4 years ago

There's 2 ways I can approach this, neither of which is particularly difficult from an implementation perspective.

The first is to just add an instruction for setting the source-side of a chunk and then continue with the current approach of having patterns match source of LUs and target of chunks.

The other thing that I could do is change the pattern matcher so that it applies to both source and target. The downside of this is that it would be backwards-incompatible and any existing rtx files would have to be recompiled.

In either case the hardest part from my perspective would be determining the proper syntax for referring to these things in rules.

ftyers commented 4 years ago

I'm not sure I like the idea so much of referring to the source side in the matching rules, @MarcRiera do you have a use case for that?

Given that we will be able to use <assert> to pick out the subset that we want a rule to apply to, is it really necessary?

I think of the SL side as basically providing "static" information that we need to maintain for checking grammaticality of the source parse, but that we don't need for target generation.

In terms of the format for specifying it, how about something like:

  <rule comment="n" firstChunk="N">
   <pattern>
    <pattern-item n="n"/>
   </pattern>
   <action>
      <call-macro n="f_set_chunk_name1"><with-param pos="1"/></call-macro>
      <call-macro n="f_concord1"><with-param pos="1"/></call-macro>
      <call-macro n="f_set_chunk_poss1"><with-param pos="1"/></call-macro>
      <out>
       <chunk>
        <source>
          <lit-tag v="N"/><clip pos="1" side="sl" part="whole"/>
        </source>
        <target>
          <lit-tag v="N"/> 
          <var n="chunkGenero"/><var n="chunkNumero"/><var n="chunkPoss"/> 
        </target>
        <contents>
          <lu> <clip pos="1" side="tl" part="whole"/> </lu>
        </contents>
       </chunk>
      </out>
   </action>
  </rule>

~I'm not sure I like the <label> part completely, that might need some iteration, but it's a first pass.~ Maybe something like this with <source> and <target> explicitly.

marcriera commented 4 years ago

@ftyers I was thinking specifically in lexical-based conditions such as the example with footwear in the formalism:

footwear = boot sock shoe sandal;
((1.number = du) and (1.lem/tl in_caseless footwear))

The problem is that the nature of recursive transfer promotes the use of much shorter rules than shallow transfer, yet checking if something is in "footwear" can only be done when first converting lexical units to chunks.

I agree with the source side being "static". @mr-martian I would prefer a solution like the first option you mention, where a chunk could also inherit the source side, but only allow checking it in conditions when explicitly mentioning /sl. Mixing source and target for pattern matching sounds potentially troublesome.

mr-martian commented 4 years ago

Matching both sides wouldn't change what rules can do, but it would make lexicalized weights more powerful. https://wiki.apertium.org/wiki/Apertium-recursive#Lexicalized_Weights

I can definitely change <chunk> to contain source? target contents.

@MarcRiera, do you have any opinions on how to edit the source side in non-xml? Maybe $whole/sl=[1.lem/sl]@N.[1.tags/sl] for the equivalent of the xml example above?

Matching is currently done on source if it exists, otherwise target. Assuming no one has been creating empty chunks, I think it should work to change that to match on target if it has children and source if it doesn't.

marcriera commented 4 years ago

@mr-martian Indeed, lexicalized weights would be more powerful, but there is the slight possibility of completely unrelated lexical units with the same spelling in source and target that could get mixed up if we decide to match both source and target in the same way.

Your non-XML proposal looks good to me.

mr-martian commented 4 years ago

As of 292de2b and fcb5809, you can now set the source side of chunks. For full symmetry, you can also set the reference side.

@ftyers The syntax you suggested is now the only syntax for chunks.

<chunk>
  <source> <!-- optional -->
    <!-- clip, var, or anything else that evaluates to a string -->
  </source>
  <target>
    <!-- clip, var, or anything else that evaluates to a string -->
  </target>
  <reference> <!-- optional -->
    <!-- clip, var, or anything else that evaluates to a string -->
  </reference>
  <contents>
    <!-- lu, mlu, chunk, b -->
  </contents>
</chunk>

@MarcRiera Parsing ended up being a little tricky, so currently the syntax is /sl={lu} and the cleanest way I've come up with to write what I had above is

whatever: tags;
...
[ /sl=*(whatever)[lem=1.lem/sl, tags=1.tags/sl]]

Note that /sl=1 will compile, but it means "set the surface of this chunk to the final output of 1". If you don't care about the lemma you can do /sl=whatever@[1.tags/sl] because the @ syntax doesn't allow non-literal lemmas.

In any event, I think I'll put this in a separate issue.

ftyers commented 4 years ago

I've been thinking about it, and I think it makes more sense from a formalism point of view to only ever do pattern matching (for both terminals and chunks) on the source side. e.g. <def-cat>s should apply to the source side for both terminals and non-terminals.

However,

<assert> can refer to any of the sides
the target side is the one that number variables, <3> are linked to.

mr-martian commented 4 years ago

I think you're right, and making that change would be fairly straightforward.

This would be a slightly backwards-incompatible change in the compiled files (they explicitly clip from target), so I'm going to hold off until I've finished up the other stuff on the todo list for the next release.

ftyers commented 4 years ago

Ok, I'll keep off rewriting the quc-spa rules for now, but I'll be thinking about them :)

mr-martian commented 4 years ago

Turns out it's far less straightforward than I thought and there's several questions connected to it, so I'm closing this as done and opening a new issue for surface as matching.

apertium / apertium-recursive

Give chunks both a source and target side? #59