fasiha / curtiz-japanese-nlp

Use Japanese NLP tools to annotate Curtiz (version 2) Markdown files
The Unlicense
11 stars 2 forks source link

bug? No hits, only kanjidic, for example phrase on Mac OS #9

Closed louismollick closed 11 months ago

louismollick commented 11 months ago

Hi!

First off, thank you very much for all your work on this package! I'm trying to make an app which provides word-by-word translation for japanese sentences, and most existing solutions don't offer this "all-in-one" functionality (tokenization + dict) -- except for ichiran for which I'm currently docker instance, but not ideal since I'd prefer a Node.js package instead.

So needless to say, I'm interested in getting this package to work for me!

Now for the "bug": I only get results in kanjidic (empty array in furigana / hits) for the example phrase へましたらリーダーに切られるだけ. I get this same output for both:

Click to expand the below output:

curtiz.json ``` [ { "furigana": [], "hits": [], "kanjidic": { "切": { "nanori": [ "きつ", "きり", "ぎり" ], "readings": [ "セツ", "サイ", "き.る", "-き.る", "き.り", "-き.り", "-ぎ.り", "き.れる", "-き.れる", "き.れ", "-き.れ", "-ぎ.れ" ], "meanings": [ "cut", "cutoff", "be sharp" ], "literal": "切", "dependencies": [ { "node": "七", "nodeMapped": { "nanori": [ "し", "しっ", "な", "ひち" ], "readings": [ "シチ", "なな", "なな.つ", "なの" ], "meanings": [ "seven" ], "literal": "七" }, "children": [] }, { "node": "刀", "nodeMapped": { "nanori": [ "き", "ち", "と", "わき" ], "readings": [ "トウ", "かたな", "そり" ], "meanings": [ "sword", "saber", "knife" ], "literal": "刀" }, "children": [] } ] } }, "bunsetsus": [ { "morphemes": [], "idx": 0, "parent": -1 }, { "morphemes": [], "idx": 0, "parent": -1 } ] } ] ```

I am using the 3 files as mentioned in the README:

And for the dependencies I did:

I am running node v20.3.0 and Mac 12.6.

Things I've tried:

$ jdepp                                   
(input: STDIN [-I 0])
へましたらリーダーに切られるだけ
* 0 1D
へ   助詞,格助詞,*,*,へ,へ,*
* 1 3D
ましたら    動詞,*,子音動詞サ行,タ系条件形,ます,ましたら,代表表記:増す/ます 反義:動詞:減らす/へらす;動詞:減る/へる
* 2 3D
リーダー    名詞,普通名詞,*,*,リーダー,りーだー,代表表記:リーダー/りーだー カテゴリ:人
に   助詞,格助詞,*,*,に,に,連語
* 3 -1D
切ら  動詞,*,子音動詞ラ行,未然形,切る,きら,代表表記:切る/きる 補文ト 付属動詞候補(基本) 自他動詞:自:切れる/きれる
れる  接尾辞,動詞性接尾辞,母音動詞,基本形,れる,れる,代表表記:れる/れる
だけ  助詞,副助詞,*,*,だけ,だけ,*
EOS

Below is an image of my working directory, maybe something looks off?

Screen Shot 2023-12-20 at 7 54 04 PM

Thanks!

fasiha commented 11 months ago

Thanks for checking out my library! I really like Ichiran and I'm working on a way to merge MeCab (via Curtiz) and Ichiran but I do know what you mean about an all-Node solution being ideal 😅

I think what's happening is your mecab isn't using Unidic. Can you share the output of mecab -D? Here's what it prints on mine:

$ mecab -D
filename:   /opt/homebrew/lib/mecab/dic/unidic/sys.dic
version:    102
charset:    utf8
type:   0
size:   756463
left size:  5981
right size: 5981

Notice the filename. When you invoke mecab as is, it loads this dictionary file from /opt/homebrew/etc/mecabrc which for me is:

;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
;dicdir =  /opt/homebrew/lib/mecab/dic/ipadic
dicdir = /opt/homebrew/lib/mecab/dic/unidic
; userdic = /home/foo/bar/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n

Therefore when I run MeCab on your example, I get very different output:

$ echo へましたらリーダーに切られるだけ | mecab
へま  ヘマ  ヘマ  へま  名詞-普通名詞-形状詞可能
し   シ   スル  為る  動詞-非自立可能    サ行変格    連用形-一般
たら  タラ  タ   た   助動詞 助動詞-タ   仮定形-一般
リーダー    リーダー    リーダー    リーダー-leader 名詞-普通名詞-一般
に   ニ   ニ   に   助詞-格助詞
切ら  キラ  キル  切る  動詞-非自立可能    五段-ラ行   未然形-一般
れる  レル  レル  れる  助動詞 助動詞-レル  連体形-一般
だけ  ダケ  ダケ  だけ  助詞-副助詞
EOS

Notice how Unidic outputs tab-spaced columns whereas your jdepp shows comma-separated values, which reminds me of IPADIC?

Here's what jdepp outputs for me:

$ echo へましたらリーダーに切られるだけ|mecab | jdepp
(input: STDIN [-I 0])
# S-ID: 1; J.DepP
* 0 2D
へま  ヘマ  ヘマ  へま  名詞-普通名詞-形状詞可能
し   シ   スル  為る  動詞-非自立可能    サ行変格    連用形-一般
たら  タラ  タ   た   助動詞 助動詞-タ   仮定形-一般
* 1 2D
リーダー    リーダー    リーダー    リーダー-leader 名詞-普通名詞-一般
に   ニ   ニ   に   助詞-格助詞
* 2 -1D
切ら  キラ  キル  切る  動詞-非自立可能    五段-ラ行   未然形-一般
れる  レル  レル  れる  助動詞 助動詞-レル  連体形-一般
だけ  ダケ  ダケ  だけ  助詞-副助詞
EOS

Note that I built JDepP with Unidic support.

Let me know if the above is helpful!

louismollick commented 11 months ago

So after much agony, I finally installed mecab & jdepp with unidic support, but I'm still having the same issue when running curtiz demo.ts:

[
  {
    furigana: [],
    hits: [],
    kanjidic: {
      '切': ....etc....
   }
]

Yet I've managed to get mecab to use the correct unidic:

mecab -D
filename:       /usr/local/lib/mecab/dic/unidic/sys.dic
version:        102
charset:        utf8
type:   0
size:   756463
left size:      5981
right size:     5981

And here's a log of me building jdepp with unidic support like this: ./configure --with-mecab-dict=UNI && make model && make install And I am able to get the same result as you for: echo へましたらリーダーに切られるだけ|mecab | jdepp See this file: jdepp-build.txt

The main "difference"(?) / challenge I faced, was that when I did: brew install mecab and brew install mecab-uni and I updated mecabrc:

cat /usr/local/etc/mecabrc              
;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
; dicdir =  /usr/local/lib/mecab/dic/ipadic
dicdir = /usr/local/lib/mecab/dic/unidic

; userdic = /home/foo/bar/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n

But for some reason, by default which mecab and which mecab-settings pointed to /opt/local/bin/mecab and /opt/local/bin/mecab-config DESPITE mecab-uni only being installed to /usr/local/lib/mecab/dic/unidic.

So when jdepp was building, it was giving me something like: checking for mecab... /opt/local/bin/mecab no such file or directory: /opt/local/bin/mecab

So my solution was just to delete /opt/local/bin/mecab and /opt/local/bin/mecab-config, and now I have which mecab /usr/local/bin/mecab. After that, jdepp ./configure was able to succeed.

BTW I think the reason why my installation goes to /usr/local is because I'm using an Intel mac instead of an ARM https://apple.stackexchange.com/a/410829

So TLDR curtiz is unfortunately still not working on my end :') Do you think my issues with /usr/local vs /opt/local could be related to this issue?

fasiha commented 11 months ago

Let's make sure the basics work. The following will let you clone this repo, install deps, and then check that the Curtiz MeCab wrapper is working as expected:

# optional, for fresh setup
git clone https://github.com/fasiha/curtiz-japanese-nlp.git
cd curtiz-japanese-nlp
npm install

# actual test
echo へましたらリーダーに切られるだけ | node mecabUnidic.js

This should print out a Markdown table:

# 1 parsing
| Literal  | Pron.    | Lemma Read. | Lemma           | PoS                    | Infl. Type           | Infl.                |
| -------- | -------- | ----------- | --------------- | ---------------------- | -------------------- | -------------------- |
| へま     | ヘマ     | ヘマ        | へま            | noun-common-adjectival |                      |                      |
| し       | シ       | スル        | 為る            | verb-bound             | sahen_verb_irregular | continuative-general |
| たら     | タラ     | タ          | た              | auxiliary_verb         | auxiliary-ta         | conditional-general  |
| リーダー | リーダー | リーダー    | リーダー-leader | noun-common-general    |                      |                      |
| に       | ニ       | ニ          | に              | particle-case          |                      |                      |
| 切ら     | キラ     | キル        | 切る            | verb-bound             | godan_verb-ra_column | irrealis-general     |
| れる     | レル     | レル        | れる            | auxiliary_verb         | auxiliary-reru       | attributive-general  |
| だけ     | ダケ     | ダケ        | だけ            | particle-adverbial     |                      |                      |

If that works, then Curtiz is finding MeCab. Yay!

Next, can you then run this quick demo script to test if it's finding JDepP:

var {mecabJdepp} = require('.');
mecabJdepp('へましたらリーダーに切られるだけ').then(res => console.dir(res, {depth: null}));

(If you set this up as a fresh clone of this repo, then you'll need jmdict-eng-3.1.0.json and JmdictFurigana.json in the current directory, sorry, I foolishly made those a requriement for running JdepP, but I think you've already downloaded those.)

This should output some JSON, with morphemes and bunsetsus:

[
  {
    morphemes: [
      {
        literal: 'へま',
        pronunciation: 'ヘマ',
        lemmaReading: 'ヘマ',
        lemma: 'へま',
        partOfSpeech: [ 'noun', 'common', 'adjectival' ],
        inflectionType: null,
        inflection: null
      },
      {
        literal: 'し',
        pronunciation: 'シ',
        lemmaReading: 'スル',
        lemma: '為る',
        partOfSpeech: [ 'verb', 'bound' ],
        inflectionType: [ 'sahen_verb_irregular' ],
        inflection: [ 'continuative', 'general' ]
      },
      {
        literal: 'たら',
        pronunciation: 'タラ',
        lemmaReading: 'タ',
        lemma: 'た',
        partOfSpeech: [ 'auxiliary_verb' ],
        inflectionType: [ 'auxiliary', 'ta' ],
        inflection: [ 'conditional', 'general' ]
      },
      {
        literal: 'リーダー',
        pronunciation: 'リーダー',
        lemmaReading: 'リーダー',
        lemma: 'リーダー-leader',
        partOfSpeech: [ 'noun', 'common', 'general' ],
        inflectionType: null,
        inflection: null
      },
      {
        literal: 'に',
        pronunciation: 'ニ',
        lemmaReading: 'ニ',
        lemma: 'に',
        partOfSpeech: [ 'particle', 'case' ],
        inflectionType: null,
        inflection: null
      },
      {
        literal: '切ら',
        pronunciation: 'キラ',
        lemmaReading: 'キル',
        lemma: '切る',
        partOfSpeech: [ 'verb', 'bound' ],
        inflectionType: [ 'godan_verb', 'ra_column' ],
        inflection: [ 'irrealis', 'general' ]
      },
      {
        literal: 'れる',
        pronunciation: 'レル',
        lemmaReading: 'レル',
        lemma: 'れる',
        partOfSpeech: [ 'auxiliary_verb' ],
        inflectionType: [ 'auxiliary', 'reru' ],
        inflection: [ 'attributive', 'general' ]
      },
      {
        literal: 'だけ',
        pronunciation: 'ダケ',
        lemmaReading: 'ダケ',
        lemma: 'だけ',
        partOfSpeech: [ 'particle', 'adverbial' ],
        inflectionType: null,
        inflection: null
      }
    ],
    bunsetsus: [
      {
        morphemes: [
          {
            literal: 'へま',
            pronunciation: 'ヘマ',
            lemmaReading: 'ヘマ',
            lemma: 'へま',
            partOfSpeech: [ 'noun', 'common', 'adjectival' ],
            inflectionType: null,
            inflection: null
          },
          {
            literal: 'し',
            pronunciation: 'シ',
            lemmaReading: 'スル',
            lemma: '為る',
            partOfSpeech: [ 'verb', 'bound' ],
            inflectionType: [ 'sahen_verb_irregular' ],
            inflection: [ 'continuative', 'general' ]
          },
          {
            literal: 'たら',
            pronunciation: 'タラ',
            lemmaReading: 'タ',
            lemma: 'た',
            partOfSpeech: [ 'auxiliary_verb' ],
            inflectionType: [ 'auxiliary', 'ta' ],
            inflection: [ 'conditional', 'general' ]
          }
        ],
        idx: 0,
        parent: 2
      },
      {
        morphemes: [
          {
            literal: 'リーダー',
            pronunciation: 'リーダー',
            lemmaReading: 'リーダー',
            lemma: 'リーダー-leader',
            partOfSpeech: [ 'noun', 'common', 'general' ],
            inflectionType: null,
            inflection: null
          },
          {
            literal: 'に',
            pronunciation: 'ニ',
            lemmaReading: 'ニ',
            lemma: 'に',
            partOfSpeech: [ 'particle', 'case' ],
            inflectionType: null,
            inflection: null
          }
        ],
        idx: 1,
        parent: 2
      },
      {
        morphemes: [
          {
            literal: '切ら',
            pronunciation: 'キラ',
            lemmaReading: 'キル',
            lemma: '切る',
            partOfSpeech: [ 'verb', 'bound' ],
            inflectionType: [ 'godan_verb', 'ra_column' ],
            inflection: [ 'irrealis', 'general' ]
          },
          {
            literal: 'れる',
            pronunciation: 'レル',
            lemmaReading: 'レル',
            lemma: 'れる',
            partOfSpeech: [ 'auxiliary_verb' ],
            inflectionType: [ 'auxiliary', 'reru' ],
            inflection: [ 'attributive', 'general' ]
          },
          {
            literal: 'だけ',
            pronunciation: 'ダケ',
            lemmaReading: 'ダケ',
            lemma: 'だけ',
            partOfSpeech: [ 'particle', 'adverbial' ],
            inflectionType: null,
            inflection: null
          }
        ],
        idx: 2,
        parent: -1
      }
    ]
  }
]

Do you get something like this?

louismollick commented 11 months ago
Screen Shot 2023-12-23 at 1 38 52 AM

I fixed it!!! I had to modify this line in node_modules: https://github.com/fasiha/curtiz-japanese-nlp/blob/master/mecabUnidic.js#L271 to be '/usr/local/lib/mecab/dic/unidic' instead of '/opt/homebrew/lib/mecab/dic/unidic' :) so it seems like the homebrew Intel vs ARM distinction was significant! (see here: https://apple.stackexchange.com/a/410829)

I figured it out after the first command you sent was failing:

➜  curtiz-japanese-nlp git:(master) ✗ echo へましたらリーダーに切られるだけ | node mecabUnidic.js

# 1 parsing
/Users/louismollick/curtiz-japanese-nlp/mecabUnidic.js:407
        if (header.length && header.length !== table[0].length) {
                                                        ^

TypeError: Cannot read properties of undefined (reading 'length')
    at printMarkdownTable (/Users/louismollick/curtiz-japanese-nlp/mecabUnidic.js:407:57)
    at /Users/louismollick/curtiz-japanese-nlp/mecabUnidic.js:451:21
    at Generator.next (<anonymous>)
    at fulfilled (/Users/louismollick/curtiz-japanese-nlp/mecabUnidic.js:6:58)

Node.js v20.3.0

Which I was able to fix & get the same result as you by doing the same change to '/usr/local/lib/mecab/dic/unidic'

Jdepp node file was able to run completely fine!

Thanks for all your help!