apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

Carefulcase eats words it can't generate #35

Open unhammer opened 6 years ago

unhammer commented 6 years ago

If the dictionary has

<?xml version="1.0" encoding="UTF-8"?>
<dictionary>
 <alphabet/>
 <sdefs>
   <sdef n="n"/>
   <sdef n="m"/>
   <sdef n="pl"/>
   <sdef n="def"/>
 </sdefs>
 <section id="main" type="standard">

<e><p><l>kakene</l><r>kake<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>

<e><p><l>pc-ane</l><r>pc<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>
<e><p><l>PC-ane</l><r>PC<s n="n"/><s n="m"/><s n="pl"/><s n="def"/></r></p></e>

 </section>
</dictionary>

then we get

$ echo '^kake<n><m><pl><def>$ ^KAKE<n><m><pl><def>$ ^kake<n><m><pl><def>$'|lt-proc -C nob.autogen.bin 
kakene  kakene

I would like it to just fall back to "normal" generation for words it can't find exact case for, ie.

$ echo '^kake<n><m><pl><def>$ ^KAKE<n><m><pl><def>$ ^kake<n><m><pl><def>$'|lt-proc -C nob.autogen.bin 
kakene KAKENE kakene

while still retaining the -C functionality for words it can find exact matches for

$ echo '^PC<n><m><pl><def>$ ^pc<n><m><pl><def>$' | lt-proc -C nob.autogen.bin
PC-ane pc-ane
jimregan commented 6 years ago

I lost my laptop three weeks ago, so it'll be a while before I can look at this.

On Thursday, 25 October 2018, Kevin Brubeck Unhammer < notifications@github.com> wrote:

Assigned #35 https://github.com/apertium/lttoolbox/issues/35 to @jimregan https://github.com/jimregan.

— You are receiving this because you were assigned. Reply to this email directly, view it on GitHub https://github.com/apertium/lttoolbox/issues/35#event-1925711741, or mute the thread https://github.com/notifications/unsubscribe-auth/AAN4FoMJsbXjFMdMmrDhxgKbjMluOebpks5uoYc4gaJpZM4X53Y5 .

unhammer commented 6 years ago

ouch :((

unhammer commented 5 years ago

I added some tests in fd6e6dc – it turns out to be problematic if we start generating ^KAKE<n><f><pl><def>$ and see a possible path that starts ^K but then only ends up in other analyses (e.g. ^KK<np>$). Then we end up with #KAKE where we should have tried a lowercased analysis.

But if there were no such garden paths, ^KAKE<n><f><pl><def>$ does give an analysis – see difference between the two test dix'es added https://github.com/apertium/lttoolbox/commit/fd6e6dca7562200e182d77b65bc759380d95df08#diff-839e968af7bf80a08ea4d97247cbe7fdR1

unhammer commented 1 year ago

@mr-martian Do you think this is solvable? I'd love to have a solution for this (but in bilingual mode lt-proc -b), s.t. that I can e.g. have a dix with

<e>       <re>[a-zA-Z]+</re><p><l></l><r><s n="np"/></r></p></e>
<e>       <i>med</i>        <p><l></l><r><s n="pr"/></r></p></e>

and get

$ echo '^Med<pr>$ ^AbCd<np>$' |lt-proc -C -b nob-nno.autogen.bin
^Med<pr>/Med$ ^AbCd<np>/AbCd$

Currently, we can get either the one or the other:

$ echo '^Med<pr>$ ^AbCd<np>$' |lt-proc  -C tmp.bin # eats Med
 AbCd

$ echo '^Med<pr>$ ^AbCd<np>$' |lt-proc  -b tmp.bin # includes extra "Abcd"
^Med<pr>/Med$ ^AbCd<np>/AbCd/Abcd$

$ echo '^Med<pr>$ ^AbCd<np>$' |lt-proc  -c -g tmp.bin # fails to generate Med since lemma is lowercase
#Med AbCd

Possibly related to https://github.com/apertium/lttoolbox/issues/167