buruzaemon / natto-py

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
BSD 2-Clause "Simplified" License
92 stars 13 forks source link

Node-formatting ignored when using Unidic unless -O is set to empty string #99

Closed buruzaemon closed 6 years ago

buruzaemon commented 6 years ago

As reported by @massongit in pull request #98 , node-formatting seems to be ignored by mecab when using Unidic. Please refer to taku910/mecab#41.

A workaround is to force natto-py to accept an empty string value for output -O.

Steps to reproduce:

  1. Install Unidic 2.1.2
  2. Execute code snippet A below to observe that natto-py will not be able to respect the node-formatting specified, but instead use the default node-format for Unidic
  3. Contrast code snippet A (natto-py)with B and C (using mecab from command-line)
# Snippet A
# Note that node-formatting is ignored and defaults to node-format-unidic
>>> with MeCab(r'-d /opt/mecab/lib/mecab/dic/unidic -F%m\t%t,%f[12]\n') as nm:
...     for n in nm.parse('日本語だよ、これが。', as_nodes=True):
...         print(n.feature)
...
日本    ニッポン        ニッポン        日本    名詞-固有名詞-地名-国
語      ゴ      ゴ      語      名詞-普通名詞-一般
だ      ダ      ダ      だ      助動詞  助動詞-ダ       終止形-一般
よ      ヨ      ヨ      よ      助詞-終助詞
、                      、      補助記号-読点
これ    コレ    コレ    此れ    代名詞
が      ガ      ガ      が      助詞-格助詞
。                      。      補助記号-句点
EOS
# Snippet B
# Note that node-formatting is ignored and defaults to node-format-unidic
$ echo '日本語だよ、これが。' | mecab -d /opt/mecab/lib/mecab/dic/unidic/ -F%m\\t%t,%f[12]\\n
日本    ニッポン        ニッポン        日本    名詞-固有名詞-地名-国
語      ゴ      ゴ      語      名詞-普通名詞-一般
だ      ダ      ダ      だ      助動詞  助動詞-ダ       終止形-一般
よ      ヨ      ヨ      よ      助詞-終助詞
、                      、      補助記号-読点
これ    コレ    コレ    此れ    代名詞
が      ガ      ガ      が      助詞-格助詞
。                      。      補助記号-句点
EOS
# Snippet C
# node-formatting is honored when -O is passed an empty string!
$ echo '日本語だよ、これが。' | mecab -d /opt/mecab/lib/mecab/dic/unidic/ -F%m\\t%t,%f[12]\\n -O ""
日本    2,固
語      2,漢
だ      6,和
よ      6,和
、      3,記号
これ    6,和
が      6,和
。      3,記号
EOS
buruzaemon commented 6 years ago

The output-format-type option is used in a dictionary's dicrc to specify a default output format type for node-formatting. For example consider the following sample dicrc for Unidic:

output-format-type = unidic2

node-format-unidic = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
unk-format-unidic  = %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
bos-format-unidic  =
eos-format-unidic  = EOS\n

node-format-chamame = \t%m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
;unk-format-chamame = \t%m\t\t\t%m\tUNK\t\t\n
unk-format-chamame  = \t%m\t\t\t%m\t%F-[0,1,2,3]\t\t\n
bos-format-chamame  = B
eos-format-chamame  = 

node-format-unidic2 = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\t%f[12]\n
unk-format-unidic2  = %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
bos-format-unidic2  =
eos-format-unidic2  = EOS\n

Here, the default formatting when no other is specified is then *-format-unidic2

MeCab gives preference to output-format-type over node-format, etc., unless output-format-type is explicitly set to be empty. This behavior is consistent across ipadic, jumandic and unidic dictionaries.

massongit commented 6 years ago

MeCab's PR (https://github.com/taku910/mecab/pull/38) maybe solve this problem.

buruzaemon commented 6 years ago

I will close this issue. However, I have updated the output-format-type MeCab option description in the project wiki to describe how to override an existing, default output format by specifying an empty string.