apertium / apertium-kaz

Apertium linguistic data for Kazakh
https://apertium.github.io/apertium-kaz/
GNU General Public License v3.0
17 stars 9 forks source link

kaz-morph is broken #20

Closed jonorthwash closed 3 years ago

jonorthwash commented 4 years ago

What's broken

The kaz-morph mode uses lt-proc and kaz.automorf.bin.

However, it only returns this:

Error: Invalid dictionary (hint: the left side of an entry is empty)

Why it's broken

This appears to be because of some ~empty paths, e.g.

0   2   @0@ <ltr>   0.000000
2   83630   @0@ @0@ 0.000000
2   848 @0@ @0@ 0.000000
848 0.000000
83630   0.000000

These empty paths appear to be due to the guesser being intersected with kaz@Cyrl-kaz@Arab.hfst. Relevant excerpts below:

apertium-kaz.kaz.lexc:

LEXICON LTR

%<ltr%>: # ;

LEXICON Guesser

<( а | ә | б | в | г | ғ | д | е | ё | ж | з | и | і | й | к | қ | л | м | н |
   ң | о | ө | п | р | с | т | у | ұ | ү | ф | х | һ | ц | ч | ш | щ | ь | ы |
   ъ | э | ю | я )> LTR ;

apertium-kaz.Cyrl-Arab.twol:

 ь:0
 ы:ى
 ъ:0

What we should do about it

Ideally, I think we need to find some way to not intersect the guesser part of the transducer with Cyrl-Arab. Alternatively, we could tweak the lexical conversion to not allow paths that would just be 0 (though I'm not positive how to do that upon first contemplation).

(Thanks to @mr-martian for helping me figure out why lt-proc was failing.) @IlnarSelimcan @ftyers

IlnarSelimcan commented 4 years ago

I think first of all we should revert the master branch to 40f49a0da2b4b8fc4214d053f5035ee697a1fe9e .

Do you mind if I do:

git checkout 40f49a0da2b4b8fc4214d053f5035ee697a1fe9e .
git commit

(https://stackoverflow.com/questions/4114095/how-do-i-revert-a-git-repository-to-a-previous-commit) and then apply Tino's changes ?

I had assumed that the fail was because of the recent commits regarding the multiscipt transducers, but even the snapshot aab4808acc1dd3625747d2722bf78fee959afd83 gives this error.

jonorthwash commented 4 years ago

I had assumed that the fail was because of the recent commits regarding the multiscipt transducers

Oh! I thought so too. I guess we need to debug further...

ftyers commented 3 years ago

I went through all the commits from the latest to the last:

for i in `git log | grep commit | cut -f2 -d' ' `; do 
    echo $i
    echo $i >> log
    git checkout $i
    make
    echo "сәлем" | apertium -d . kaz-morph >> log 2>&1
done

Seems like the offending commit was in 2018:

1ca61f138745784ee4a5e6493009ed5075a63e25
Error: Invalid dictionary (hint: the left side of an entry is empty)
638b3ed5efb8a2508635689ddb13f2b6dc719641
^сәлем/сәлем<ij>/сәлем<n><nom>/сәлем<n><attr>/сәлем<n><nom>+е<cop><aor><p3><pl>/сәлем<n><nom>+е<cop><aor><p3><sg>$^./.<sent>$
$ git diff 638b3ed5efb8a2508635689ddb13f2b6dc719641 1ca61f138745784ee4a5e6493009ed5075a63e25
diff --git a/apertium-kaz.kaz.lexc b/apertium-kaz.kaz.lexc
index 87bdcaf..46db838 100644
--- a/apertium-kaz.kaz.lexc
+++ b/apertium-kaz.kaz.lexc
@@ -2228,37 +2228,11 @@ InterrogativePronouns ;
 алтау:алтау NUM-COLL ;
 жетеу:жетеу NUM-COLL ;

-
-C NUM-ROMAN ; ! ""
-D NUM-ROMAN ; ! ""
-DC NUM-ROMAN ; ! ""
-I NUM-ROMAN ; ! ""
-II NUM-ROMAN ; ! ""
-III NUM-ROMAN ; ! ""
-IV NUM-ROMAN ; ! ""
-V NUM-ROMAN ; ! ""
-VI NUM-ROMAN ; ! ""
-VII NUM-ROMAN ; ! ""
-VIII NUM-ROMAN ; ! ""
-IX NUM-ROMAN ; ! ""
-L NUM-ROMAN ; ! ""
-M NUM-ROMAN ; ! ""
-X NUM-ROMAN ; ! ""
-XI NUM-ROMAN ; ! ""
-XII NUM-ROMAN ; ! ""
-XIII NUM-ROMAN ; ! ""
-XIV NUM-ROMAN ; ! ""
-XV NUM-ROMAN ; ! ""
-XVI NUM-ROMAN ; ! ""
-XVII NUM-ROMAN ; ! ""
-XVIII NUM-ROMAN ; ! ""
-XIX NUM-ROMAN ; ! ""
-XX NUM-ROMAN ; ! ""
-XXI NUM-ROMAN ; ! ""
-
 неше:неше NUM-ITG ;
 қанша:қанша NUM-ITG ;

+<(M | D | C | L | X | V | I)+> NUM-ROMAN ; ! ""
+

 !=============!
  LEXICON Nouns
@@ -7873,6 +7847,7 @@ iш% жүргізетін:iш% жүргізетін N1 ; !
 сәбіз:сәбіз N1 ; ! "carrots"
 сәжде:сәжде N1 ; ! ""
 сәйгүлік:сәйгүлік N1 ; !"Use/MT"
+сәл:сәл N1 ; ! "~ біраз"
 сәлем:сәлем N1 ; ! "peace,greeting"
 сәлемдеме:сәлемдеме N1 ; ! ""
 сәлемдесу:сәлемдесу N1 ; ! ""
@@ -33029,6 +33004,7 @@ retroactive:retroactive A1 ; !"Use/MT"
 қолма%-қолсыз:қолма%-қолсыз ADV ; ! "" ! Use/MT
 қып:қып ADV ; ! ""
 міне:міне ADV ; ! ""
+мінекі:мінекі ADV ; ! ""
 мүлдем:мүлдем ADV ; ! ""
 мүлде:мүлде ADV ; ! ""
 мүмкін% емес:мүмкін% емес ADV ; !"Use/MT" ! невозможно
@@ -33211,8 +33187,6 @@ retroactive:retroactive A1 ; !"Use/MT"
 шын% жүректен:шын% жүректен ADV ; ! 
 кілең:кілең ADV ; ! 

-
-
 ! Adverbs wich take -KI and become an attr/adj
 ! ============================================

@@ -37517,6 +37491,8 @@ retroactive:retroactive A1 ; !"Use/MT"
 ысыл:ысыл V-IV ; ! "to acquire a habit"
 ысыр:ысыр V-TV ; ! "to move"
 ышқын:ышқын V-IV ; ! "to be in an agonizing condition"
+сабақта:сабақта V-TV ; ! "go on (talking)"
+есіне:есіне V-TV ; ! "yawn"

 эвакуацияла:эвакуацияла V-TV ; ! "to evacuate"
 экстрадицияла:экстрадицияла V-TV ; ! ""
@@ -39355,6 +39331,7 @@ retroactive:retroactive A1 ; !"Use/MT"
 мә:мә INTERJ ; ! "here you go / take"
 мейлі:мейлі INTERJ ; ! "okay/fine"
 міне:міне INTERJ ; ! "here is, вот"
+мінекі:мінекі INTERJ ; ! "here is, вот"
 мына:мына INTERJ ; ! "here is, вот"
 ойбай:ойбай INTERJ ; ! "" 
 қайырлы% таң:қайырлы% таң INTERJ ; ! "good morning"
IlnarSelimcan commented 3 years ago

This should be fixed/worked around now: https://github.com/apertium/apertium-kaz/pull/23