Open siikamiika opened 7 years ago
😍 THANK YOU so much for seeking out my little repo and giving this valuable advice! I’m a huge beginner with Japanese (we’ll see how quickly I can learn 😁), and I can only understand the outlines of your explanation, but I will revisit it till I get it.
Poking around your Github—I love your projects, especially siikamiika/mecab-translate! It sounds exactly what I need—a UniDic-based version of Jisho.org (which, as you probably know, is powered by https://github.com/Kimtaro/ve), in that it combines morphemes and points them to JMdict entries, even if they’re highly-inflected verbs. I am following!
I can only understand the outlines of your explanation, but I will revisit it till I get it.
Then perhaps I should provide an example.
; f[0]: pos1 動詞 名詞
; f[1]: pos2 一般 普通名詞
; f[2]: pos3 * サ変可能
; f[3]: pos4 * *
; f[4]: cType 五段-ラ行 *
; f[5]: cForm 連用形-促音便 *
; f[6]: lForm サエギル テスト
; f[7]: lemma 遮る テスト-test
; f[8]: orth 遮っ
; f[9]: pron サエギッ
; f[10]: orthBase 遮る
; f[11]: pronBase サエギル
; f[12]: goshu 和
; f[13]: iType *
; f[14]: iForm *
; f[15]: fType *
; f[16]: fForm *
[siikamiika@espowered unidic-mecab-2.1.2_src]$ mecab -d .
漢字は書けますか?
漢字 名詞,普通名詞,一般,*,*,*,カンジ,漢字,漢字,カンジ,漢字,カンジ,漢,*,*,*,*
は 助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
書け 動詞,一般,*,*,下一段-カ行,連用形-一般,カク,書く,書け,カケ,書ける,カケル,和,*,*,*,*
ます 助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
か 助詞,終助詞,*,*,*,*,カ,か,か,カ,か,カ,和,*,*,*,*
? 補助記号,句点,*,*,*,*,,?,?,,?,,記号,*,*,*,*
If you look at the line starting with 書け, you can see that the the lemma
(f[7]) is a godan verb but cType
(f[4]) is 下一段-カ行. Fine. What if that meant the potential form?
[siikamiika@espowered unidic-mecab-2.1.2_src]$ mecab -d .
漢字はかけますか?
漢字 名詞,普通名詞,一般,*,*,*,カンジ,漢字,漢字,カンジ,漢字,カンジ,漢,*,*,*,*
は 助詞,係助詞,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
かけ 動詞,非自立可能,*,*,下一段-カ行,連用形-一般,カケル,掛ける,かけ,カケ,かける,カケル,和,カ濁,基本形,*,*
ます 助動詞,*,*,*,助動詞-マス,終止形-一般,マス,ます,ます,マス,ます,マス,和,*,*,*,*
か 助詞,終助詞,*,*,*,*,カ,か,か,カ,か,カ,和,*,*,*,*
? 補助記号,句点,*,*,*,*,,?,?,,?,,記号,*,*,*,*
Here we can see that 掛ける that has the same reading as 書く's potential form (UniDic decided that かけます in kana more often means 掛ける than 書く) and their cType
is exactly the same. Okay, what if we looked up orthBase
(f[10]) from the dictionary instead of lemma
? It seems to be consistent with the given cType
. Well, it turns out that there is no entry in JMDict for 書く's potential form. In addition, it follows the orthography as can be seen in the 掛ける example where it is all in kana.
This is probably fine with Japanese people because their intuition tells whether a verb is in the potential form or not and I don't think that UniDic was designed to be used with an English dictionary. Anyway, for us, learners, this works better:
[siikamiika@espowered unidic-mecab-translate]$ mecab -d .
漢字は書けますか?
漢字 noun,common,ordinary,*,*,*,カンジ,漢字,漢字,カンジ,漢字,カンジ,漢,*,*,*,*
は particle,binding,*,*,*,*,ハ,は,は,ワ,は,ワ,和,*,*,*,*
書け verb,ordinary,*,*,potential,continuative,カク,書く,書け,カケ,書ける,カケル,和,*,*,*,*
ます aux-verb,*,*,*,aux|masu,terminal,マス,ます,ます,マス,ます,マス,和,*,*,*,*
か particle,final,*,*,*,*,カ,か,か,カ,か,カ,和,*,*,*,*
? symbol,period,*,*,*,*,,?,?,,?,,記号,*,*,*,*
I love your projects, especially siikamiika/mecab-translate!
Thanks! I love that project as well. When I started it about a year ago, I knew very little Japanese, but using it my reading skills have improved considerably.
About Kimtaro/ve, you're exactly right. The project even had ve's name in it (yes, I suck at naming projects).
About the project though, I am not experienced in professional web development and while I can find out how to do something in html/js/css with the help of Google, I must have reinvented the wheel poorly in many parts of the project. I'd appreciate some hacking if you happen to find the project useful. I haven't set a license, but feel free to copy parts of the code to your own projects.
OH, now I see! That IS super-confusing for beginners and it’s awesome you found a good way to detect this situation.
I see that if I ask for the top five tokenizations (mecab -N5
), the potential of 書く does show up at #4:
a | b | c | d | e | f | g | h |
---|---|---|---|---|---|---|---|
1 | かけ | カケ | カケル | 掛ける | 動詞-非自立可能 | 下一段-カ行 | 連用形-一般 |
2 | かけ | カケ | カケル | 駆ける | 動詞-一般 | 下一段-カ行 | 連用形-一般 |
3 | かけ | カケ | カケル | 掛ける | 動詞-非自立可能 | 下一段-カ行 | 連用形-一般 |
4 | かけ | カケ | カク | 書く | 動詞-一般 | 下一段-カ行 | 連用形-一般 |
5 | かけ | カケ | カケル | 掛ける | 動詞-非自立可能 | 下一段-カ行 | 連用形-一般 |
I’d love to try unidic-mecab-translate and get the style of output you showed there with the translated inflection types/etc., and if it’s possible, can you give me a two-second summary of how to get that output with unidic-mecab-translate? Thank you!
(I ask because although I use some translations I found for these UniDic terms, it would be really nice to see potential
marked, e.g. This is what I see in my app, which is a little more helpful than MeCab/Kuromoji’s raw output but not by much!:
{ literal = "漢字", literalPronunciation = "カンジ", writtenForm = "漢字", writtenBaseForm = "漢字", lemma = "漢字", lemmaReading = "カンジ", lemmaPronunciation = "カンジ", partOfSpeech = ["noun","common","general"], conjugation = ["uninflected"], conjugationType = [], position = 0, languageType = "漢", furigana = Just ([Furigana "漢" "かん" AutoFurigana,Furigana "字" "じ" AutoFurigana]) }
{ literal = "は", literalPronunciation = "ワ", writtenForm = "は", writtenBaseForm = "は", lemma = "は", lemmaReading = "ハ", lemmaPronunciation = "ワ", partOfSpeech = ["particle","binding"], conjugation = ["uninflected"], conjugationType = [], position = 2, languageType = "和", furigana = Nothing }
{ literal = "かけ", literalPronunciation = "カケ", writtenForm = "かけ", writtenBaseForm = "かける", lemma = "掛ける", lemmaReading = "カケル", lemmaPronunciation = "カケル", partOfSpeech = ["verb","bound"], conjugation = ["continuative","general"], conjugationType = ["shimoichidan-verb-e-row","ka-column"], position = 3, languageType = "和", furigana = Nothing }
{ literal = "ます", literalPronunciation = "マス", writtenForm = "ます", writtenBaseForm = "ます", lemma = "ます", lemmaReading = "マス", lemmaPronunciation = "マス", partOfSpeech = ["auxiliary-verb"], conjugation = ["conclusive","general"], conjugationType = ["auxiliary","masu"], position = 5, languageType = "和", furigana = Nothing }
{ literal = "か", literalPronunciation = "カ", writtenForm = "か", writtenBaseForm = "か", lemma = "か", lemmaReading = "カ", lemmaPronunciation = "カ", partOfSpeech = ["particle","phrase-final"], conjugation = ["uninflected"], conjugationType = [], position = 7, languageType = "和", furigana = Nothing }
{ literal = "?", literalPronunciation = "", writtenForm = "?", writtenBaseForm = "?", lemma = "?", lemmaReading = "", lemmaPronunciation = "", partOfSpeech = ["supplementary-symbol","period"], conjugation = ["uninflected"], conjugationType = [], position = 8, languageType = "記号", furigana = Nothing }
)can you give me a two-second summary of how to get that output with unidic-mecab-translate?
Do you mean how to use it with MeCab? You can download and extract the latest release from the project page and use the dictionary with mecab -d path/to/unidic-mecab-translate
.
If you mean how to detect the potential form in general, referring to the CSV explanations in my previous comment, you
cType
starts with 下一段 (is an ichidan verb that ends -eru and not -iru)lForm
's 2 last characters are not equal to pronBase
's 2 last characters (There is a function that removes chouon (long vowel marks) because pronBase
uses them but lForm
does not.)When testing unidic-mecab-translate with -N
I noticed that the translations don't always show up. It may be due to some fallbacks defined in unk.def/right-id.def/left-id.def/rewrite.def but I don't know how they work. I should probably make the script translate them as well.
EDIT: Done. unidic-mecab-translate release 1.2 works as it should.
Also the translations aren't guaranteed to be accurate because I used JMDict and Google to guess what they could mean without spending too much time wondering what their deeper meaning was. The translations here are probably more accurate but some of them don't exist in the version of UniDic I'm using or at least are not in the list I've autogenerated from lex.csv.
You seem to use kuromoji instead of MeCab, but do you happen to know if the I could use that for something in mecab-translate, but I fear that if I expect -N
ALWAYS gives exactly the amount of tokenizations you ask for?EOS
n times before stopping to wait for output, there will be a lock-up until some timeout that I specify. At least empty input gives just one EOS
. Of course I could use some bindings instead of stdin/stdout, but that would add complexity.
(EDIT: I tried to input a
with -N20
and got my answer. Seems like my options are limited to running new mecab for each request (SLOW), timeout, or WebSocket with an ID that is incremented client side and returned back with each tokenization)
I found at least one case when potential form is incorrectly detected.
original 読んでてさ | pos1 | pos2 | pos3 | pos4 | cType | cForm | lForm | lemma | orth | pron | orthBase | pronBase | |||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
読ん | verb | ordinary | * | * | godan | continuative,nasal | ヨム | 読む | 読ん | ヨン | 読む | ヨム | 和 | * | * | * | * |
で | aux-verb | * | * | * | potential | continuative | テル | てる | で | デ | でる | デル | 和 | * | * | * | * |
て | particle | conjunctive | * | * | * | * | テ | て | て | テ | て | テ | 和 | * | * | * | * |
さ | particle | final | * | * | * | * | サ | さ | さ | サ | さ | サ | 和 | * | * | * | * |
That's because the lForm
is テル and pronBase
is デル. This could be detected by comparing the lengths of the strings but then the potential form of 愛する, 愛せる, wouldn't be detected. Maybe it's best to add this as a special case.
(Text from here.)
For the record, I found this by investigating a similar "problem" with ipadic
Hey!
I noticed that you were working on a similar project and thought I'd let you know that there is a "problem" with UniDic's inflection information fields. When a godan verb is inflected to the potential form, they call it "下一段". While the potential form technically is an ichidan verb, it is confusing at least when you try look up the lemma from a dictionary.
Here's what I used to find out whether it is potential form or not when I translated the fields to English: https://github.com/siikamiika/unidic-mecab-translate/blob/master/translate_lex.py#L295
And here are some related experiments: https://github.com/siikamiika/scripts/tree/master/unidic-mecab-experiments
Hope this helped!