buruzaemon / natto

A Tasty Ruby Binding with MeCab
BSD 2-Clause "Simplified" License
143 stars 16 forks source link

Inconsistent parsing results between tostr and tonodes #75

Closed himkt closed 4 years ago

himkt commented 4 years ago

To reproduce this problem, it requires mecab-ipadic-neologd.

@kikaineko found this bug (?).

Code

require 'natto'

dicdir = `mecab-config --dicdir`.strip
neologd_path = "#{dicdir}/mecab-ipadic-neologd"
natto_with_neologd = Natto::MeCab.new(dicdir: neologd_path)

puts '#tonodes'
natto_with_neologd.parse('HM粉') {|_|
  next if _.feature =~ /EOS/
  puts _.feature
}

puts
puts '#tostr'
ret = natto_with_neologd.parse('HM粉')
ret.split("\n") do |morph|
  next if morph =~ /EOS/
  puts morph.split("\t")[1]
end

Output

#tonodes
名詞,固有名詞,一般,*,*,*,H・M,エイチエムシ,エイチエムシ
名詞,接尾,一般,*,*,*,粉,コ,コ

#tostr
名詞,固有名詞,一般,*,*,*,HM,エイチエム,エイチエム
名詞,接尾,一般,*,*,*,粉,コ,コ
buruzaemon commented 4 years ago

Thank you for bringing this issue to my attention, @himkt.

I understand that this problem occurs with use of mecab-ipadic-neologd , so please give me some time to set up a test environment to investigate this.

I will get back to you as soon as possible. I appreciate your patience and cooperation!

buruzaemon commented 4 years ago

OK, I was able to reproduce this interesting error on Mac, quite easily.

I will need a bit more time to investigate, this appears to be something happening at a low level.

Thank you for your continued patience!

buruzaemon commented 4 years ago

After compiling examples/example.c and then comparing the output with Natto, I think I found where this bug is. I need a bit more time to correct this, and run through all of my tests again.

Please give me a few more days, I should be releasing this fix soon.

Thank you again for your patience!

buruzaemon commented 4 years ago

When node parsing, MECAB_NBEST is being set for request_type on the lattice reference, regardless of whether or not NBEST parsing is specified as an option. This forced NBEST parsing makes the lattice return as the first choice the feature for H・M,エイチエムシ.

Note that mecab-ipadic-neologd dictionary has multiple entries for HM, including:

This forced request_type setting in the node parsing method will be removed.

buruzaemon commented 4 years ago

The fix has just been release a v1.2.0. Please go ahead and grab the latest natto Rubygem, and see if this fixes the bug found by @kikaineko.

himkt commented 4 years ago

Thank you for fixing!

buruzaemon commented 4 years ago

Thanks to @himkt and @kikaineko for finding this bug! これからもよろしくお願いします。