apertium / lttoolbox

Finite state compiler, processor and helper tools used by apertium
http://wiki.apertium.org/wiki/Lttoolbox
GNU General Public License v2.0
18 stars 22 forks source link

letter case issue in post-generation #123

Closed hectoralos closed 2 years ago

hectoralos commented 3 years ago

In Gascon Occitan there is the (enonciative) adverb "e". The problem is that (1) it must be dropped if the following word begins with a vowel, and (2) it often happens to be at the beginning of a sentence, thus in upper case. Since it is a question of the letter that follows, seemingly this can be resolved only in post-generation. So, I try to add a rule like:

        <e> 
                <p> 
                        <l><a/>e<b/>a</l>
                        <r>a</r>
                </p>
        </e>

But it doesn't work:

$ echo '~e aga' | lt-proc -p oci.autoprepgen.bin
a\/aga
$ echo '~E aga' | lt-proc -p oci.autoprepgen.bin
a\/aga

I have been trying several hacks, but all have problems. Maybe workarounds may be found, but, because there is no way to specify whether to get the case from the first or the second element of the expression, the issue of generating the correct capitalisation seems unsolvable.

Any ideas?

mr-martian commented 3 years ago
<e>
  <p>
    <l><a/>e<b/>a</l>
    <r>a</r>
  </p>
</e>
<e>
  <p>
    <l><a/>E<b/>a</l>
    <r>A</r>
  </p>
</e>
$ echo '~e aga' | lt-proc -p blah.bin
a\/aga
$ echo '~E aga' | lt-proc -p blah.bin
A\/aga
$ echo '~E Aga ~e Aga' | lt-proc -p blah.bin
A\/Aga A\/Aga

How's that?

hectoralos commented 2 years ago

Sorry, I missed your comment. The main problem subsist: a\/aga. I can't understand this double results.

mr-martian commented 2 years ago

https://github.com/apertium/lttoolbox/blob/e7418efeae8ac7e234ba39d1336b98d7650c5140/lttoolbox/fst_processor.cc#L1949-L1964

It looks to me like the issue is in these lines, but I have absolutely no idea what the point of this section is.

mr-martian commented 2 years ago

I've figured out what's causing this, but not why or how to fix it.

Whoever wrote the postgenerator seems to have assumed that rules would be reading the end of one word and the beginning of the next but only modifying the first word. So postgen hops backward to have the matched rule only affect the previous word.

What went wrong here is that it does that hopping back by assuming that you only want to look at the first letter of the next word, so it hops back 2 characters to land just before the space. However, here we're deleting the preceding word so hopping back 2 puts us on / due to how the internals of lt-proc work.

A temporary workaround for this would be to add <re>[a-zA-Z]</re> at the end of the rule so that output will be 2 characters instead of 1.

hectoralos commented 2 years ago

This change seems to cause a lots of problems, as I see in Beta. If you try to translate from French into Occitan la fin, La fin. Vers la fin, you get la fin, Detla fin. Cap a detla fin. In my system (without updating Apertium) I have:

$ echo "la fin, La fin. Vers la fin" | apertium -d . fra-oci-dgen
~detla fin, ~Detla fin. Cap ~a ~detla fin
$ echo "la fin, La fin. Vers la fin" | apertium -d . fra-oci
la fin, La fin. Cap a la fin
unhammer commented 2 years ago

So this seems like two regressions,

  1. it's no longer casefolding:
    $ cat ost.dix
<?xml version="1.0" encoding="utf-8"?>
<dictionary>
  <alphabet/>

  <pardefs>
    <pardef n="quotes_empty">
      <e><i></i></e>
      <e><i>"</i></e>
    </pardef>
  </pardefs>

  <section id="main" type="standard">

    <e>
      <par n="quotes_empty"/>
      <p>
        <l><a/>detla</l>
        <r>la</r>
      </p>
    </e>

    <e>
      <p>
        <l><a/>a<b/><a/></l>
        <r>a<b/></r>
      </p>
    </e>

  </section>

</dictionary>
$ lt-comp lr ost.dix ost.bin
main@standard 8 8
$ echo ~detla |lt-proc -p ost.bin
la
$ echo ~Detla |lt-proc -p ost.bin
Detla
  1. perhaps this is the rerunning you mentioned @mr-martian where it used to backtrack to j-2:
    $ echo ~a ~detla |lt-proc -p ost.bin
    a detla
unhammer commented 2 years ago

(If these are difficult, perhaps we should revert #144 until fixed? )

mr-martian commented 2 years ago

Added case back in 5d490e7 - that was just me forgetting.

As for the backtracking, I can put that back in, but especially in the example you give it really seems to me like a bug, since what's being backtracked over is not a common suffix so you're just discarding part of the output.

mr-martian commented 2 years ago

@hectoralos what is the rule that should be controlling ~a in your example?

hectoralos commented 2 years ago

~a is a preposition that becomes e.g. al and als when it comes before ~detlo or ~detlos and merges with them:

$ echo "au cheval, aux chevaux" | apertium -d . fra-oci-dgen
~a ~detlo caval, ~a ~detlos cavals
$ echo "au cheval, aux chevaux" | apertium -d . fra-oci
al caval, als cavals

It is quite similar for all Romance languages I have in mind.

I don't know if you are asking this...

mr-martian commented 2 years ago

I mean I want to see what you wrote in the postgen file.

hectoralos commented 2 years ago

I didn't write this, but the relevant lines for Languedocian Occitan seem to be these:

        <e>
                <p>
                        <l><a/>a<b/><a/>detlo<b/></l>
                        <r>al<b/></r>
                </p>
        </e>
        <e>
                <p>
                        <l><a/>a<b/><a/>detlos<b/></l>
                        <r>als<b/></r>
                </p>
        </e>
        <e>
                <p>
                        <l><a/>a<b/><a/>detlo<b/></l>
                        <r>a<b/>l'</r>
                </p>
                <par n="vocal_general"/>
        </e>
unhammer commented 2 years ago

so

        <e>
                <p>
                        <l><a/>a<b/><a/>detla<b/></l>
                        <r>al<b/></r>
                </p>
        </e>

is missing @hectoralos (maybe that <r> should be something else?).

hectoralos commented 2 years ago

It shouldn't be this way, but I don't think this is the point. Without the new version, the fra-oci translator is working for both masculine and feminine. This is an input example:

le lit, Le lit. Vers le lit. Maison du lit.
l'ami, L'ami. Vers l'ami. Maison de l'ami.
la fin, La fin. Vers la fin. Maison de la fin.
l'amie, L'amie. Vers l'amie. Maison de l'amie.

les lits, Les lits. Vers les lits. Maison des lits.
les amis, Les amis. Vers les amis. Maison des amis.
les fins, Les fins. Vers les fins. Maison des fins.
les amies, Les amies. Vers les amies. Maison des amies.

There are the outputs in my PC:

$ cat zzz | apertium -d . fra-oci-dgen
~detlo lièch, ~Detlo lièch. Cap ~a ~detlo lièch. Ostal ~de ~detlo lièch.
~detlo amic, ~Detlo amic. Cap ~a ~detlo amic. Ostal ~de ~detlo amic.
~detla fin, ~Detla fin. Cap ~a ~detla fin. Ostal ~de ~detla fin.
~detla amiga, ~Detla amiga. Cap ~a ~detla amiga. Ostal ~de ~detla amiga.

~detlos lièchs, ~Detlos lièchs. Cap ~a ~detlos lièchs. Ostal ~de ~detlos lièchs.
~detlos amics, ~Detlos amics. Cap ~a ~detlos amics. Ostal ~de ~detlos amics.
~detlas fins, ~Detlas fins. Cap ~a ~detlas fins. Ostal ~de ~detlas fins.
~detlas amigas, ~Detlas amigas. Cap ~a ~detlas amigas. Ostal ~de ~detlas amigas.
$ cat zzz | apertium -d . fra-oci
lo lièch, Lo lièch. Cap al lièch. Ostal del lièch.
l'amic, L'amic. Cap a l'amic. Ostal de l'amic.
la fin, La fin. Cap a la fin. Ostal de la fin.
l'amiga, L'amiga. Cap a l'amiga. Ostal de l'amiga.

los lièchs, Los lièchs. Cap als lièchs. Ostal dels lièchs.
los amics, Los amics. Cap als amics. Ostal dels amics.
las fins, Las fins. Cap a las fins. Ostal de las fins.
las amigas, Las amigas. Cap a las amigas. Ostal de las amigas.

Here is the output from Beta:

lo lièch, Detlo lièch. Cap al lièch. Ostal del lièch.
l'amic, Detlo amic. Cap a l'amic. Ostal de l'amic.
la fin, Detla fin. Cap a detla fin. Ostal de la fin.
l'amiga, Detla amiga. Cap a detla amiga. Ostal de l'amiga.

los lièchs, Detlos lièchs. Cap als lièchs. Ostal dels lièchs.
los amics, Detlos amics. Cap als amics. Ostal dels amics.
las fins, Detlas fins. Cap a detlas fins. Ostal de las fins.
las amigas, Detlas amigas. Cap a detlas amigas. Ostal de las amigas.

I can imagine that, for some reason, explicit rules for ~a ~detla were not needed before (I can't find them in the code), but now they are. If this was the problem, I would take it without problems and add a few rules. What certainly is not working now is the capitalisation issue: Detlo, Detla, Detlos, Detlas. Once it'd fixed, we'll see if the other problems remain.

mr-martian commented 2 years ago

I ran it locally and got

lo lièch, Lo lièch. Cap al lièch. Ostal del lièch.
l'amic, L'amic. Cap a l'amic. Ostal de l'amic.
la fin, La fin. Cap a detla fin. Ostal de la fin.
l'amiga, L'amiga. Cap a detla amiga. Ostal de l'amiga.

los lièchs, Los lièchs. Cap als lièchs. Ostal dels lièchs.
los amics, Los amics. Cap als amics. Ostal dels amics.
las fins, Las fins. Cap a detlas fins. Ostal de las fins.
las amigas, Las amigas. Cap a detlas amigas. Ostal de las amigas.

The only difference I see is a detla and a detlas, so the current output of Beta is the same, so I think my case commit hadn't gotten there when you tried.

I'll see if I can figure out what's going wrong with ~a ~detlas.

hectoralos commented 2 years ago

So, seemingly Beta has not yet been updated with the last change (it hasn't for several days as for the oci-fra stuff, by the way). If the only problem seems to be now ~detla, ~detlas, I'll modify the post-oci.dix file, since it is strange that I couldn't find anything for ~a ~detla and ~a ~detlas in it.

hectoralos commented 2 years ago

Unfortunately, I updated my system and I get tons of detlo, detla, etc. in the output. Seemingly the patch is today's. The result is horrible (((

As for the seemingly missing rules for ~a ~detla and ~a ~detlas, I have added them and found them to be counterproductive. The problem is when there are unknown proper names like La BlaBlaBla. Without the rule we get a La BlaBlaBlaBla; with the rule, a la BlaBlaBlaBla. This is not a terribly important problem, but the logic of not putting explicit rules seems to be this, because before ~a and ~detla could be solved independently, and it seems that now they cannot if they are followed by each other.

mr-martian commented 2 years ago

Ok, I'm traveling relatively soon and might not be able to properly solve this at present, so I'm just going to revert all postgen-related changes and try again in a couple weeks.