fasiha / clj-kuromoji-jmdictfurigana

Kuromoji is great. JmdictFurigana is great. Wouldn’t it be great if they got together?
The Unlicense
5 stars 0 forks source link

Potential form #1

Open siikamiika opened 7 years ago

siikamiika commented 7 years ago

Hey!

I noticed that you were working on a similar project and thought I'd let you know that there is a "problem" with UniDic's inflection information fields. When a godan verb is inflected to the potential form, they call it "下一段". While the potential form technically is an ichidan verb, it is confusing at least when you try look up the lemma from a dictionary.

Here's what I used to find out whether it is potential form or not when I translated the fields to English: https://github.com/siikamiika/unidic-mecab-translate/blob/master/translate_lex.py#L295

And here are some related experiments: https://github.com/siikamiika/scripts/tree/master/unidic-mecab-experiments

Hope this helped!

fasiha commented 7 years ago

😍 THANK YOU so much for seeking out my little repo and giving this valuable advice! I’m a huge beginner with Japanese (we’ll see how quickly I can learn 😁), and I can only understand the outlines of your explanation, but I will revisit it till I get it.

Poking around your Github—I love your projects, especially siikamiika/mecab-translate! It sounds exactly what I need—a UniDic-based version of Jisho.org (which, as you probably know, is powered by https://github.com/Kimtaro/ve), in that it combines morphemes and points them to JMdict entries, even if they’re highly-inflected verbs. I am following!

siikamiika commented 7 years ago

I can only understand the outlines of your explanation, but I will revisit it till I get it.

Then perhaps I should provide an example.

; f[0]:  pos1 動詞                   名詞
; f[1]:  pos2 一般                   普通名詞
; f[2]:  pos3 *                     サ変可能
; f[3]:  pos4 *                     *
; f[4]:  cType 五段-ラ行              *
; f[5]:  cForm 連用形-促音便          *
; f[6]:  lForm サエギル                 テスト
; f[7]:  lemma 遮る                    テスト-test
; f[8]:  orth 遮っ                      
; f[9]:  pron サエギッ                   
; f[10]: orthBase 遮る                
; f[11]: pronBase サエギル               
; f[12]: goshu 和
; f[13]: iType *
; f[14]: iForm *
; f[15]: fType *
; f[16]: fForm *
[siikamiika@espowered unidic-mecab-2.1.2_src]$ mecab -d .
漢字は書けますか?
漢字  名詞,普通名詞,一般,*,*,*,カンジ,漢字,漢字,カンジ,漢字,カンジ,漢,*,*,*,*
は   助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
書け  動詞,一般,*,*,下一段-カ行,連用形-一般,カク,書く,書け,カケ,書ける,カケル,和,*,*,*,*
ます  助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
か   助詞,終助詞,*,*,*,*,カ,か,か,カ,か,カ,和,*,*,*,*
?   補助記号,句点,*,*,*,*,,?,?,,?,,記号,*,*,*,*

If you look at the line starting with 書け, you can see that the the lemma (f[7]) is a godan verb but cType (f[4]) is 下一段-カ行. Fine. What if that meant the potential form?

[siikamiika@espowered unidic-mecab-2.1.2_src]$ mecab -d .
漢字はかけますか?
漢字  名詞,普通名詞,一般,*,*,*,カンジ,漢字,漢字,カンジ,漢字,カンジ,漢,*,*,*,*
は   助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
かけ  動詞,非自立可能,*,*,下一段-カ行,連用形-一般,カケル,掛ける,かけ,カケ,かける,カケル,和,カ濁,基本形,*,*
ます  助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
か   助詞,終助詞,*,*,*,*,カ,か,か,カ,か,カ,和,*,*,*,*
?   補助記号,句点,*,*,*,*,,?,?,,?,,記号,*,*,*,*

Here we can see that 掛ける that has the same reading as 書く's potential form (UniDic decided that かけます in kana more often means 掛ける than 書く) and their cType is exactly the same. Okay, what if we looked up orthBase (f[10]) from the dictionary instead of lemma? It seems to be consistent with the given cType. Well, it turns out that there is no entry in JMDict for 書く's potential form. In addition, it follows the orthography as can be seen in the 掛ける example where it is all in kana.

This is probably fine with Japanese people because their intuition tells whether a verb is in the potential form or not and I don't think that UniDic was designed to be used with an English dictionary. Anyway, for us, learners, this works better:

[siikamiika@espowered unidic-mecab-translate]$ mecab -d .
漢字は書けますか?
漢字  noun,common,ordinary,*,*,*,カンジ,漢字,漢字,カンジ,漢字,カンジ,漢,*,*,*,*
は   particle,binding,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
書け  verb,ordinary,*,*,potential,continuative,カク,書く,書け,カケ,書ける,カケル,和,*,*,*,*
ます  aux-verb,*,*,*,aux|masu,terminal,マス,ます,ます,マス,ます,マス,和,*,*,*,*
か   particle,final,*,*,*,*,カ,か,か,カ,か,カ,和,*,*,*,*
?   symbol,period,*,*,*,*,,?,?,,?,,記号,*,*,*,*

I love your projects, especially siikamiika/mecab-translate!

Thanks! I love that project as well. When I started it about a year ago, I knew very little Japanese, but using it my reading skills have improved considerably.

About Kimtaro/ve, you're exactly right. The project even had ve's name in it (yes, I suck at naming projects).

About the project though, I am not experienced in professional web development and while I can find out how to do something in html/js/css with the help of Google, I must have reinvented the wheel poorly in many parts of the project. I'd appreciate some hacking if you happen to find the project useful. I haven't set a license, but feel free to copy parts of the code to your own projects.

fasiha commented 7 years ago

OH, now I see! That IS super-confusing for beginners and it’s awesome you found a good way to detect this situation.

I see that if I ask for the top five tokenizations (mecab -N5), the potential of 書く does show up at #4:

a b  c d e f g h
1 かけ カケ カケル 掛ける 動詞-非自立可能 下一段-カ行 連用形-一般
2 かけ カケ カケル 駆ける 動詞-一般 下一段-カ行 連用形-一般
3 かけ カケ カケル 掛ける 動詞-非自立可能 下一段-カ行 連用形-一般
4 かけ カケ カク 書く 動詞-一般 下一段-カ行 連用形-一般
5 かけ カケ カケル 掛ける 動詞-非自立可能 下一段-カ行 連用形-一般

I’d love to try unidic-mecab-translate and get the style of output you showed there with the translated inflection types/etc., and if it’s possible, can you give me a two-second summary of how to get that output with unidic-mecab-translate? Thank you!

(I ask because although I use some translations I found for these UniDic terms, it would be really nice to see potential marked, e.g. This is what I see in my app, which is a little more helpful than MeCab/Kuromoji’s raw output but not by much!:

siikamiika commented 7 years ago

can you give me a two-second summary of how to get that output with unidic-mecab-translate?

Do you mean how to use it with MeCab? You can download and extract the latest release from the project page and use the dictionary with mecab -d path/to/unidic-mecab-translate.

If you mean how to detect the potential form in general, referring to the CSV explanations in my previous comment, you

  1. Check that cType starts with 下一段 (is an ichidan verb that ends -eru and not -iru)
  2. Check that lForm's 2 last characters are not equal to pronBase's 2 last characters (There is a function that removes chouon (long vowel marks) because pronBase uses them but lForm does not.)

When testing unidic-mecab-translate with -N I noticed that the translations don't always show up. It may be due to some fallbacks defined in unk.def/right-id.def/left-id.def/rewrite.def but I don't know how they work. I should probably make the script translate them as well.

EDIT: Done. unidic-mecab-translate release 1.2 works as it should.

Also the translations aren't guaranteed to be accurate because I used JMDict and Google to guess what they could mean without spending too much time wondering what their deeper meaning was. The translations here are probably more accurate but some of them don't exist in the version of UniDic I'm using or at least are not in the list I've autogenerated from lex.csv.

You seem to use kuromoji instead of MeCab, but do you happen to know if the -N ALWAYS gives exactly the amount of tokenizations you ask for? I could use that for something in mecab-translate, but I fear that if I expect EOS n times before stopping to wait for output, there will be a lock-up until some timeout that I specify. At least empty input gives just one EOS. Of course I could use some bindings instead of stdin/stdout, but that would add complexity.

(EDIT: I tried to input a with -N20 and got my answer. Seems like my options are limited to running new mecab for each request (SLOW), timeout, or WebSocket with an ID that is incremented client side and returned back with each tokenization)

siikamiika commented 7 years ago

I found at least one case when potential form is incorrectly detected.

original 読んでてさ pos1 pos2 pos3 pos4 cType cForm lForm lemma orth pron orthBase pronBase
読ん verb ordinary * * godan continuative,nasal ヨム 読む 読ん ヨン 読む ヨム * * * *
aux-verb * * * potential continuative テル てる でる デル * * * *
particle conjunctive * * * * * * * *
particle final * * * * * * * *

That's because the lForm is テル and pronBase is デル. This could be detected by comparing the lengths of the strings but then the potential form of 愛る, 愛る, wouldn't be detected. Maybe it's best to add this as a special case.

(Text from here.)

aehlke commented 5 months ago

For the record, I found this by investigating a similar "problem" with ipadic