apertium / apertium-kaz

Apertium linguistic data for Kazakh
https://apertium.github.io/apertium-kaz/
GNU General Public License v3.0
17 stars 9 forks source link

transdcuer no longer meets Apertium Turkic standards #15

Open jonorthwash opened 5 years ago

jonorthwash commented 5 years ago

The issue with the reorganisation of the lexicon in de4c77a16e22d71e19abcaded39a834e5467089f is that different parts of speech are all lumped together.

Every single other Turkic transducer uses the lexicon names Nouns, Adjectives, Verbs, ProperNouns, etc. This is standardised for several reasons. One of which is so that we have an easy way to count the number of stems of a particular type. E.g., note that the countstems script was broken by your changes.

@IlnarSelimcan, could you justify why you did this reorganisation? Also, in principle this sort of major restructuring should be done in consultation with and by consensus among everyone it affects—that is, everyone who's committed to this repo, or at least the apertium-turkic mailing list.

IlnarSelimcan commented 5 years ago

First of all, apologies for having broken the old workflows.

Apertium-uzb and apertium-kaa are also affected by this.

If we decide to restore the old organisation, apertium-kaz can simply be reverted to https://github.com/apertium/apertium-kaz/commit/d9ee49dd5824ff34a181955a8c20e59faaf3de77 All subsequent changes were also made on https://raw.githubusercontent.com/taruen/apertiumpp/master/apertiumpp-kaz/lexicon.rkt (stems from which I plan to merge back to apertium-kaz in some sensible way, once I finish proofreading them against explanatory dictionary).

Apertium-uzb and apertium-kaa had that organisation before GSoC, but committers didn't seem to be careful enough not to put adjectives to LEXICON Nouns, nouns to LEXICON Adjectives etc.

In short, the reasons why I had reduced lexicons to Common, Proper, Punctuation and Abbreviations were that: 1) people didn't seem to respect the separation into POS-based lexicons anyway 2) duplications. duplications. duplications Same word added as both N1 and N5 (ok, same lexicon); as CS and CC, both to Adjectives and Nouns. A plain wordlist, kept in alphabetical order, makes such duplications jump out at you immediately.

Iirc, even the creators of .lexc admit (in FSM book) that some more computationally-processable format should be used for storing the lexicons (from which them .lexc files are then derived). Either lexc2dix should be polished up so that we can easily query lexicons (to count stems etc), or we just should write lexicons in some other format. I see that as a real problem, but that's only my opinion.

IlnarSelimcan commented 5 years ago

Last time I looked at it, lexc2dix was making some errors which I don't recall anymore.

mansayk commented 5 years ago

Hello to everyone!

If you don't mind I would like to say a couple of words here.

My part in apertium-tat during last few years is not so big and mostly consists of improving twol rules and working with the lexc file.

In my opinion lexc file's classic organisation really has those shortcomings described by Ilnar.

It is really difficult to improve this dictionary when you, for example, don't see whether the word you are editing is presented in other parts or not. Jumping all the time through the file is not an option here and the search also doesn't help that much, unfortunately. It slows you down significantly.

Situation gets drastically worse when the dictionary becomes bigger.

The new way will at least keep them all (nouns, adjectives, adverbs...) next to each other according to the alphabet and solve one of the biggest problems I have met here.

Actually I don't know what pros the classic organisation gives us and if you don't mind maybe it is time to consider some changes?

With best wishes, Mansur

Ilnar Salimzianov notifications@github.com schrieb am Mo., 2. Sep. 2019, 21:46:

Last time I looked at it, lexc2dix was making some errors which I don't recall anymore.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/apertium/apertium-kaz/issues/15?email_source=notifications&email_token=AEZNYQPLARKOGXWFAJEA46LQHVNRZA5CNFSM4IS6CVA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WMNMQ#issuecomment-527222450, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZNYQMD2LRYSWESGKXOWGDQHVNRZANCNFSM4IS6CVAQ .

jonorthwash commented 5 years ago

@mansayk, thank you for sharing your view on this—it's very helpful.

I'd just like to clarify one point. You say:

Jumping all the time through the file is not an option here and the search also doesn't help that much, unfortunately. It slows you down significantly.

I'm not sure I understand what the problem is. Could you provide more information on what you're having trouble with?

IlnarSelimcan commented 5 years ago

When checking a lexc file for miscategorized stems (maybe having an alphabetically sorted reference dictionary at hand, maybe not, but especially if you do), you must see all occurences of the particular stem in the lexc file (to see the continuations they have). That implies manual search.

That would mean typing Control-S in emacs, and then typing up the word you're looking for, and jumping through the file. Or selecting the word, and then searching: https://stackoverflow.com/questions/202803/searching-for-marked-selected-text-in-emacs

(I doubt that it's any faster in Vi(m) :P )

Imo that's significantly slower than going through an alphabetically sorted list and just deleting the lines where the stem has wrong cont. lexicon.

@mansayk , did you mean that?

ftyers commented 5 years ago
22:41 <spectie> actually, ilnar and mansur have a point here
22:41 <spectie> for the kind of work they are doing their system is better 
22:41 <spectie> how about a compromise like
22:41 <spectie> Open (N, V, Adv, A)
22:42 <spectie> then Closed
22:42 <spectie> and within Closed 
22:42 <spectie> Pronouns ; Determiners ; ...
22:42 <spectie> and then have separate lexicons for each of the closed categories 
22:42 <spectie> i think i would be happy with that 
22:42 <spectie> also, "weird irregular stuff" usually happens in closed categories 
IlnarSelimcan commented 5 years ago

Another thing is that categories are not independent of each other, so to speak. Sure, some stems can belong to several categories at once, but there also cases when belonging to one category excludes belonging to another.

In my worldview at least, "foo A1" makes "foo ADV" redundant, as "hargle CC" would make "hargle CS" redundant (or incorrect).

Yet another thing are improperly lexicalised wordforms. Seeing "алдында ADV" right after "ал{д} N1" should make any conscious lexicographer think.

IlnarSelimcan commented 5 years ago

I think I like what Fran suggested. Indeed pronouns especially tend to have lots of hardcoded entries anyway, hence it makes sense to keep them and other closed categories separate.

mansayk commented 5 years ago

Hi!

Jonathan, let's imagine the following situations:

  1. I need to add a new word to the lexc file:

    • I use my corpus to construct a frequency list.
    • I use apertium to mark up all the words in that list.
    • I remove the words that were successfully recognized and tagged by apertium.
    • I take unrecognized words from the top of the list (most frequent ones) one by one to insert to the lexc file. Usually I open them as 2 tabs in vim or just split the screen (vsplit).
    • Ilnar and I try to keep all our lists in the lexc file alphabetically sorted, though it is not easy because of there are many of them.
    • So I take that word and place it to the, for example, adj n1 section and go on happy and later I accidentally see that word already were presented in adj n3 section or even in adv. Why apertium didn't tag it if it already were in the dict? There are several reasons and one of the most frequent of them is because of mistakes in twol rules. That's why I payed all my attention during last year to improve it. Let me know if there are still any problems in twol rules. If all the pos categories were together in the same list I would see that word with all its other pos tags in the first place. Why didn't I use the search first? Mostly I do, but when it comes to the short words it makes many wrong matches and I should use regex syntax and additional symbols...
  2. I need to change the POS tag of the word existing in the lexc file:

    • I use search to locate the word. If that word is presented in several sections it is not convenient to jump through the whole file instead of seeing them in one place. And we cannot just edit the tag of that word currently miscategorized, but we also need to move it manually to the according section and to the according place among other sorted words. What if I forget to do the last part? According to the presence of many cases I saw when words with one tag are located in the wrong section it is a quite common problem.
  3. Let's remember the last commits of koguzhan to the Tatar lexc file. There were many duplications I think mostly because of the current lexc file's messy structure. And I did the same mistakes the time I started contributing to apertium several years ago. And we could avoid that if we had a sorted single big list.

  4. I tried to go through the whole dictionary and check it word by word several times, because there are many miscategorized words there. But when I see some word it comes to my mind that this word should also be presented in some other section(s). I put vim's anchor to the current line and go search it there and there. Again jumps and it distracts pretty much. Instead I just could see all tags of this word in one place if the whole list was sorted as one piece.

It is only a couple of the problems that came to my mind in the middle of night... It might not sound very terrible to make all those manipulations for a single word, but we deal with hundreds and thousands of them where it becomes messy, time consuming and leads to new mistakes... I hope you agree with me that the file structure should help us to avoid those mistakes, save our time, be easy to use.

Please do not consider this post as a complaining. It is just some notes from the experience of a non-professional apertiumer. Thank you!

Best, Mansur

Jonathan Washington notifications@github.com schrieb am Mo., 2. Sep. 2019, 23:32:

@mansayk https://github.com/mansayk, thank you for sharing your view on this—it's very helpful.

I'd just like to clarify one point. You say:

Jumping all the time through the file is not an option here and the search also doesn't help that much, unfortunately. It slows you down significantly.

I'm not sure I understand what the problem is. Could you provide more information on what you're having trouble with?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apertium/apertium-kaz/issues/15?email_source=notifications&email_token=AEZNYQLBSN3JDTXNWNWBVNDQHVZ5ZA5CNFSM4IS6CVA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5WQMVA#issuecomment-527238740, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZNYQPJHQH4WKVP2M3M7GTQHVZ5ZANCNFSM4IS6CVAQ .

jonorthwash commented 5 years ago

Okay, I have a better sense now of what the reasoning is. These are valid reasons, and I've experienced these issues myself. I like Fran's proposal—to keep "open" and "closed" categories separate. I would argue that closed categories should be broken down much the way we had them—or we could include conjunctions and the like with the open categories so they're near adverbs. Pronouns and determiners should definitely go together. Numbers should probably remain separate.

In any case, I'm okay lumping various categories together for the reasons stated, but I also think there are certain ways that we should keep things separate. Does this make sense? Is my general philosophy towards it compatible with everyone else's?

ftyers commented 5 years ago

I was thinking something like:

LEXICON Root

Open ;
Closed ; 
Proper ; 
Punctuation ; 
Numerals ;

LEXICON Open

bar:bar N1 ; ! ""
foo:foo N1 ; ! ""
foo:foo V-TV ;  ! ""

LEXICON Closed 

Pronouns ;
Determiners ;
Conjunctions ;
Postpositions ; 

LEXICON Pronouns 

blah:blah PRON-PERS ; ! ""

LEXICON Proper 

LEXICON Punctuation

LEXICON Numerals 
mansayk commented 5 years ago

Hi!

I would suggest to place LEXICON Open in the very end of the file, so it is easier to find where it ends when we sort it.

O maybe use some kind of @import TO LEXICON Open FROM FILE...

I am not sure about the second one. It doesn't seem a good choice here, because this way has its own shortcomings and we try to keep the lexc file solid.

I just want to say that Open and Proper categories are going to be huge and to sort them we need to find the beginning and scroll down to the end line of that category without interfering with another one.

Maybe we just need some anchors there and some universal bash script (with different LC_COLLATE parameter for each language) to sort all the categories in the lexc file when we run it?..

Best, Mansur

Francis Tyers notifications@github.com schrieb am Di., 3. Sep. 2019, 08:08:

I was thinking something like:

LEXICON Root

Open ; Closed ; Proper ; Punctuation ; Numerals ;

LEXICON Open

bar:bar N1 ; ! "" foo:foo N1 ; ! "" foo:foo V-TV ; ! ""

LEXICON Closed

Pronouns ; Determiners ; Conjunctions ; Postpositions ;

LEXICON Pronouns

blah:blah PRON-PERS ; ! ""

LEXICON Proper

LEXICON Punctuation

LEXICON Numerals

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apertium/apertium-kaz/issues/15?email_source=notifications&email_token=AEZNYQLZ5VKHWMG7ODJZLZTQHXWLPA5CNFSM4IS6CVA2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5XAYWY#issuecomment-527305819, or mute the thread https://github.com/notifications/unsubscribe-auth/AEZNYQO2G3KZBM2QEP4AF63QHXWLPANCNFSM4IS6CVAQ .

jonorthwash commented 5 years ago

I would suggest to place LEXICON Open in the very end of the file, so it is easier to find where it ends when we sort it.

I'm used to having Punctuation and Numerals (and Guesser) at the end of the file, but it doesn't much matter. I think the reason these are normally at the ends is that they're kind of "afterthoughts", and once you have them set up, you're not going to touch them much. The latter is probably true of pronouns and determiners too, though, and those are usually at the beginning.

In any case, finding the end of a lexicon isn't difficult with vim: you just enter visual mode (v) at the top of the lexicon, and search (/) for LEXICON, and then go up one line to exclude that. Then sort (:sort). (I'm not at a computer now, so I might be misremembering a detail or two, but it's still doable.)

But I certainly don't mind having Open and Proper at the end of the file. It certainly makes sense if they're the main lexicons that are going to change after some level of development. The main issue will be that you can't have them both as the last lexicon of the file...

I propose a couple adjustments to @ftyers's proposal: