Kimtaro / ve

A linguistic framework that's easy to use.
MIT License
215 stars 25 forks source link

Incorrect parsing of Japanese #16

Closed vietqhoang closed 10 years ago

vietqhoang commented 10 years ago

Examples

Expected

2.1.1 :003 > Ve.in(:ja).words("しませんでした").collect(&:word)
 => ["しません", "でした"] 

Result

2.1.1 :003 > Ve.in(:ja).words("しませんでした").collect(&:word)
 => ["しませ", "ん", "でした"] 

Expected

2.1.1 :004 > Ve.in(:ja).words("かかわらず").collect(&:word)
 => ["かかわらず"] 

Result

2.1.1 :004 > Ve.in(:ja).words("かかわらず").collect(&:word)
 => ["かかわら", "ず"] 

Expected

2.1.1 :003 > Ve.in(:ja).words("国はそれを認めませんでした").collect(&:word)
 => ["国", "は", "それ", "を", "認めません", "でした"] 

Result

2.1.1 :003 > Ve.in(:ja).words("国はそれを認めませんでした").collect(&:word)
 => ["国", "は", "それ", "を", "認めませ", "ん", "でした"] 
Kimtaro commented 10 years ago

Hmm, here's the results I get:

irb(main):004:0> Ve.in(:ja).words("しませんでした").collect(&:word)
=> ["しません", "でした"]
irb(main):005:0> Ve.in(:ja).words("かかわらず").collect(&:word)
=> ["かかわらず"]
irb(main):006:0> Ve.in(:ja).words("国はそれを認めませんでした").collect(&:word)
=> ["国", "は", "それ", "を", "認めません", "でした"]

Same results with both ipadic and naist-jdic as the mecab dictionary.

What do you have in your /usr/local/etc/mecabrc?

Mine looks like this:

;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
;dicdir =  /usr/local/lib/mecab/dic/ipadic
dicdir = /usr/local/lib/mecab/dic/naist-jdic

; userdic = /home/foo/bar/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n
vietqhoang commented 10 years ago
;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
dicdir =  /usr/local/Cellar/mecab/0.996/lib/mecab/dic/ipadic

; userdic = /home/foo/bar/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n

Looks like your environment is set with the naist-jdic, while mine is set to ipadic?

Kimtaro commented 10 years ago

I tried with ipadic as well, with the same results. But it was on a different platform than Mac/homebrew, which it looks like you're using.

I'll try again tomorrow using the homebrew ipadic to see if I can get your results.

vietqhoang commented 10 years ago

I went ahead and remove the brew installed mecab-ipadic and mecab.

Downloaded the mecab and mecab-ipadic source from the project's website (https://mecab.googlecode.com/svn/trunk/mecab/doc/index.html).

Followed the instructions for install. Changed the character code to UTF8 and rebuilt the dictionary to use UTF8.

I still get the following results

2.1.1 :002 > Ve.in(:ja).words("しませんでした").collect(&:word)
 => ["しませ", "ん", "でした"] 

mecabrc is the following

;
; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
dicdir =  /usr/local/lib/mecab/dic/ipadic

; userdic = /home/foo/bar/user.dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n

Using mecab directly from the terminal

viet$ mecab
しませんでした
し 動詞,自立,*,*,サ変・スル,連用形,する,シ,シ
ませ  助動詞,*,*,*,特殊・マス,未然形,ます,マセ,マセ
ん 助動詞,*,*,*,不変化型,基本形,ん,ン,ン
でし  助動詞,*,*,*,特殊・デス,連用形,です,デシ,デシ
た

I've installed naist-jdic, but looks like the dictionary is in euc-jp. Is there a UTF-8 version available?

I get the following

しませんでした
しません?   ????,????,*,*,*,*,*
??  ̾??,??ͭ̾??,?ȿ?,*,*,*,*
??た   ????,????,*,*,*,*,*
EOS

Looks like it is parsing correctly.

Kimtaro commented 10 years ago

Hmmmmmmmmmmmmm, still getting ok results with homebrew mecab and ipadic :)

What do you get if you do this?

require 'pp'
pp Ve.in(:ja).words("しませんでした")

Also, what version of Ve are you loading? In my Gemfile I have this to load the latest commit:

gem "ve", '0.0.3', :git => 'git://github.com/Kimtaro/ve.git', :ref => '6419334062bad5f2e283cdb01f6038f41c0e7589'

Or if you're doing it locally

$ gem build ve.gemspec 
$ gem install ve-0.0.3.gem
vietqhoang commented 10 years ago

The version of Ve was the issue.

I was running 0.0.2, which is the default when doing gem install ve or gem 've' in the Gemfile.

Building the latest commit locally fixed the issue. Also referencing the latest commit in the Gemfile worked as well.

Thanks for the troubleshoot help!

Kimtaro commented 10 years ago

Ah! Sorry about that! To be honest I had completely forgotten that Ve was on Rubygems already. I should definitely push the latest version on there.

Kimtaro commented 10 years ago

Pushed 0.0.3 to rubygems. https://rubygems.org/gems/ve