Closed louismollick closed 10 months ago
Thanks for checking out my library! I really like Ichiran and I'm working on a way to merge MeCab (via Curtiz) and Ichiran but I do know what you mean about an all-Node solution being ideal 😅
I think what's happening is your mecab
isn't using Unidic. Can you share the output of mecab -D
? Here's what it prints on mine:
$ mecab -D
filename: /opt/homebrew/lib/mecab/dic/unidic/sys.dic
version: 102
charset: utf8
type: 0
size: 756463
left size: 5981
right size: 5981
Notice the filename
. When you invoke mecab
as is, it loads this dictionary file from /opt/homebrew/etc/mecabrc
which for me is:
;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
;dicdir = /opt/homebrew/lib/mecab/dic/ipadic
dicdir = /opt/homebrew/lib/mecab/dic/unidic
; userdic = /home/foo/bar/user.dic
; output-format-type = wakati
; input-buffer-size = 8192
; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n
Therefore when I run MeCab on your example, I get very different output:
$ echo へましたらリーダーに切られるだけ | mecab
へま ヘマ ヘマ へま 名詞-普通名詞-形状詞可能
し シ スル 為る 動詞-非自立可能 サ行変格 連用形-一般
たら タラ タ た 助動詞 助動詞-タ 仮定形-一般
リーダー リーダー リーダー リーダー-leader 名詞-普通名詞-一般
に ニ ニ に 助詞-格助詞
切ら キラ キル 切る 動詞-非自立可能 五段-ラ行 未然形-一般
れる レル レル れる 助動詞 助動詞-レル 連体形-一般
だけ ダケ ダケ だけ 助詞-副助詞
EOS
Notice how Unidic outputs tab-spaced columns whereas your jdepp
shows comma-separated values, which reminds me of IPADIC?
Here's what jdepp
outputs for me:
$ echo へましたらリーダーに切られるだけ|mecab | jdepp
(input: STDIN [-I 0])
# S-ID: 1; J.DepP
* 0 2D
へま ヘマ ヘマ へま 名詞-普通名詞-形状詞可能
し シ スル 為る 動詞-非自立可能 サ行変格 連用形-一般
たら タラ タ た 助動詞 助動詞-タ 仮定形-一般
* 1 2D
リーダー リーダー リーダー リーダー-leader 名詞-普通名詞-一般
に ニ ニ に 助詞-格助詞
* 2 -1D
切ら キラ キル 切る 動詞-非自立可能 五段-ラ行 未然形-一般
れる レル レル れる 助動詞 助動詞-レル 連体形-一般
だけ ダケ ダケ だけ 助詞-副助詞
EOS
Note that I built JDepP with Unidic support.
Let me know if the above is helpful!
So after much agony, I finally installed mecab & jdepp with unidic support, but I'm still having the same issue when running curtiz demo.ts
:
[
{
furigana: [],
hits: [],
kanjidic: {
'切': ....etc....
}
]
Yet I've managed to get mecab to use the correct unidic:
mecab -D
filename: /usr/local/lib/mecab/dic/unidic/sys.dic
version: 102
charset: utf8
type: 0
size: 756463
left size: 5981
right size: 5981
And here's a log of me building jdepp with unidic support like this:
./configure --with-mecab-dict=UNI && make model && make install
And I am able to get the same result as you for:
echo へましたらリーダーに切られるだけ|mecab | jdepp
See this file:
jdepp-build.txt
The main "difference"(?) / challenge I faced, was that when I did:
brew install mecab
and brew install mecab-uni
and I updated mecabrc
:
cat /usr/local/etc/mecabrc
;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
; dicdir = /usr/local/lib/mecab/dic/ipadic
dicdir = /usr/local/lib/mecab/dic/unidic
; userdic = /home/foo/bar/user.dic
; output-format-type = wakati
; input-buffer-size = 8192
; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n
But for some reason, by default which mecab
and which mecab-settings
pointed to /opt/local/bin/mecab
and /opt/local/bin/mecab-config
DESPITE mecab-uni
only being installed to /usr/local/lib/mecab/dic/unidic
.
So when jdepp was building, it was giving me something like:
checking for mecab... /opt/local/bin/mecab
no such file or directory: /opt/local/bin/mecab
So my solution was just to delete /opt/local/bin/mecab
and /opt/local/bin/mecab-config
, and now I have which mecab
/usr/local/bin/mecab
. After that, jdepp ./configure
was able to succeed.
BTW I think the reason why my installation goes to /usr/local
is because I'm using an Intel mac instead of an ARM
https://apple.stackexchange.com/a/410829
So TLDR curtiz is unfortunately still not working on my end :') Do you think my issues with /usr/local vs /opt/local could be related to this issue?
Let's make sure the basics work. The following will let you clone this repo, install deps, and then check that the Curtiz MeCab wrapper is working as expected:
# optional, for fresh setup
git clone https://github.com/fasiha/curtiz-japanese-nlp.git
cd curtiz-japanese-nlp
npm install
# actual test
echo へましたらリーダーに切られるだけ | node mecabUnidic.js
This should print out a Markdown table:
# 1 parsing
| Literal | Pron. | Lemma Read. | Lemma | PoS | Infl. Type | Infl. |
| -------- | -------- | ----------- | --------------- | ---------------------- | -------------------- | -------------------- |
| へま | ヘマ | ヘマ | へま | noun-common-adjectival | | |
| し | シ | スル | 為る | verb-bound | sahen_verb_irregular | continuative-general |
| たら | タラ | タ | た | auxiliary_verb | auxiliary-ta | conditional-general |
| リーダー | リーダー | リーダー | リーダー-leader | noun-common-general | | |
| に | ニ | ニ | に | particle-case | | |
| 切ら | キラ | キル | 切る | verb-bound | godan_verb-ra_column | irrealis-general |
| れる | レル | レル | れる | auxiliary_verb | auxiliary-reru | attributive-general |
| だけ | ダケ | ダケ | だけ | particle-adverbial | | |
If that works, then Curtiz is finding MeCab. Yay!
Next, can you then run this quick demo script to test if it's finding JDepP:
var {mecabJdepp} = require('.');
mecabJdepp('へましたらリーダーに切られるだけ').then(res => console.dir(res, {depth: null}));
(If you set this up as a fresh clone of this repo, then you'll need jmdict-eng-3.1.0.json
and JmdictFurigana.json
in the current directory, sorry, I foolishly made those a requriement for running JdepP, but I think you've already downloaded those.)
This should output some JSON, with morphemes
and bunsetsus
:
[
{
morphemes: [
{
literal: 'へま',
pronunciation: 'ヘマ',
lemmaReading: 'ヘマ',
lemma: 'へま',
partOfSpeech: [ 'noun', 'common', 'adjectival' ],
inflectionType: null,
inflection: null
},
{
literal: 'し',
pronunciation: 'シ',
lemmaReading: 'スル',
lemma: '為る',
partOfSpeech: [ 'verb', 'bound' ],
inflectionType: [ 'sahen_verb_irregular' ],
inflection: [ 'continuative', 'general' ]
},
{
literal: 'たら',
pronunciation: 'タラ',
lemmaReading: 'タ',
lemma: 'た',
partOfSpeech: [ 'auxiliary_verb' ],
inflectionType: [ 'auxiliary', 'ta' ],
inflection: [ 'conditional', 'general' ]
},
{
literal: 'リーダー',
pronunciation: 'リーダー',
lemmaReading: 'リーダー',
lemma: 'リーダー-leader',
partOfSpeech: [ 'noun', 'common', 'general' ],
inflectionType: null,
inflection: null
},
{
literal: 'に',
pronunciation: 'ニ',
lemmaReading: 'ニ',
lemma: 'に',
partOfSpeech: [ 'particle', 'case' ],
inflectionType: null,
inflection: null
},
{
literal: '切ら',
pronunciation: 'キラ',
lemmaReading: 'キル',
lemma: '切る',
partOfSpeech: [ 'verb', 'bound' ],
inflectionType: [ 'godan_verb', 'ra_column' ],
inflection: [ 'irrealis', 'general' ]
},
{
literal: 'れる',
pronunciation: 'レル',
lemmaReading: 'レル',
lemma: 'れる',
partOfSpeech: [ 'auxiliary_verb' ],
inflectionType: [ 'auxiliary', 'reru' ],
inflection: [ 'attributive', 'general' ]
},
{
literal: 'だけ',
pronunciation: 'ダケ',
lemmaReading: 'ダケ',
lemma: 'だけ',
partOfSpeech: [ 'particle', 'adverbial' ],
inflectionType: null,
inflection: null
}
],
bunsetsus: [
{
morphemes: [
{
literal: 'へま',
pronunciation: 'ヘマ',
lemmaReading: 'ヘマ',
lemma: 'へま',
partOfSpeech: [ 'noun', 'common', 'adjectival' ],
inflectionType: null,
inflection: null
},
{
literal: 'し',
pronunciation: 'シ',
lemmaReading: 'スル',
lemma: '為る',
partOfSpeech: [ 'verb', 'bound' ],
inflectionType: [ 'sahen_verb_irregular' ],
inflection: [ 'continuative', 'general' ]
},
{
literal: 'たら',
pronunciation: 'タラ',
lemmaReading: 'タ',
lemma: 'た',
partOfSpeech: [ 'auxiliary_verb' ],
inflectionType: [ 'auxiliary', 'ta' ],
inflection: [ 'conditional', 'general' ]
}
],
idx: 0,
parent: 2
},
{
morphemes: [
{
literal: 'リーダー',
pronunciation: 'リーダー',
lemmaReading: 'リーダー',
lemma: 'リーダー-leader',
partOfSpeech: [ 'noun', 'common', 'general' ],
inflectionType: null,
inflection: null
},
{
literal: 'に',
pronunciation: 'ニ',
lemmaReading: 'ニ',
lemma: 'に',
partOfSpeech: [ 'particle', 'case' ],
inflectionType: null,
inflection: null
}
],
idx: 1,
parent: 2
},
{
morphemes: [
{
literal: '切ら',
pronunciation: 'キラ',
lemmaReading: 'キル',
lemma: '切る',
partOfSpeech: [ 'verb', 'bound' ],
inflectionType: [ 'godan_verb', 'ra_column' ],
inflection: [ 'irrealis', 'general' ]
},
{
literal: 'れる',
pronunciation: 'レル',
lemmaReading: 'レル',
lemma: 'れる',
partOfSpeech: [ 'auxiliary_verb' ],
inflectionType: [ 'auxiliary', 'reru' ],
inflection: [ 'attributive', 'general' ]
},
{
literal: 'だけ',
pronunciation: 'ダケ',
lemmaReading: 'ダケ',
lemma: 'だけ',
partOfSpeech: [ 'particle', 'adverbial' ],
inflectionType: null,
inflection: null
}
],
idx: 2,
parent: -1
}
]
}
]
Do you get something like this?
I fixed it!!! I had to modify this line in node_modules
: https://github.com/fasiha/curtiz-japanese-nlp/blob/master/mecabUnidic.js#L271 to be '/usr/local/lib/mecab/dic/unidic'
instead of '/opt/homebrew/lib/mecab/dic/unidic'
:)
so it seems like the homebrew Intel vs ARM distinction was significant! (see here: https://apple.stackexchange.com/a/410829)
I figured it out after the first command you sent was failing:
➜ curtiz-japanese-nlp git:(master) ✗ echo へましたらリーダーに切られるだけ | node mecabUnidic.js
# 1 parsing
/Users/louismollick/curtiz-japanese-nlp/mecabUnidic.js:407
if (header.length && header.length !== table[0].length) {
^
TypeError: Cannot read properties of undefined (reading 'length')
at printMarkdownTable (/Users/louismollick/curtiz-japanese-nlp/mecabUnidic.js:407:57)
at /Users/louismollick/curtiz-japanese-nlp/mecabUnidic.js:451:21
at Generator.next (<anonymous>)
at fulfilled (/Users/louismollick/curtiz-japanese-nlp/mecabUnidic.js:6:58)
Node.js v20.3.0
Which I was able to fix & get the same result as you by doing the same change to '/usr/local/lib/mecab/dic/unidic'
Jdepp node file was able to run completely fine!
Thanks for all your help!
Hi!
First off, thank you very much for all your work on this package! I'm trying to make an app which provides word-by-word translation for japanese sentences, and most existing solutions don't offer this "all-in-one" functionality (tokenization + dict) -- except for ichiran for which I'm currently docker instance, but not ideal since I'd prefer a Node.js package instead.
So needless to say, I'm interested in getting this package to work for me!
Now for the "bug": I only get results in kanjidic (empty array in furigana / hits) for the example phrase
へましたらリーダーに切られるだけ
. I get this same output for both:handleSentence
codeClick to expand the below output:
curtiz.json
``` [ { "furigana": [], "hits": [], "kanjidic": { "切": { "nanori": [ "きつ", "きり", "ぎり" ], "readings": [ "セツ", "サイ", "き.る", "-き.る", "き.り", "-き.り", "-ぎ.り", "き.れる", "-き.れる", "き.れ", "-き.れ", "-ぎ.れ" ], "meanings": [ "cut", "cutoff", "be sharp" ], "literal": "切", "dependencies": [ { "node": "七", "nodeMapped": { "nanori": [ "し", "しっ", "な", "ひち" ], "readings": [ "シチ", "なな", "なな.つ", "なの" ], "meanings": [ "seven" ], "literal": "七" }, "children": [] }, { "node": "刀", "nodeMapped": { "nanori": [ "き", "ち", "と", "わき" ], "readings": [ "トウ", "かたな", "そり" ], "meanings": [ "sword", "saber", "knife" ], "literal": "刀" }, "children": [] } ] } }, "bunsetsus": [ { "morphemes": [], "idx": 0, "parent": -1 }, { "morphemes": [], "idx": 0, "parent": -1 } ] } ] ```I am using the 3 files as mentioned in the README:
And for the dependencies I did:
brew install mecab
brew install mecab-unidic
sudo port install jdepp
as per https://ports.macports.org/port/jdepp/ and the instructions for Mac OS on https://www.tkl.iis.u-tokyo.ac.jp/~ynaga/jdepp/I am running node v20.3.0 and Mac 12.6.
Things I've tried:
jmdict-eng-3.1.0.json
instead ofjmdict-eng-3.5.0.json
(and of course deleting the previously createdjmdict-simplified
DB directory to have it recreate) and got the same outputBelow is an image of my working directory, maybe something looks off?
Thanks!