buruzaemon / natto-py

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
BSD 2-Clause "Simplified" License
92 stars 13 forks source link

Enable to specify an empty string option #98

Closed massongit closed 6 years ago

massongit commented 6 years ago

(Related to https://github.com/taku910/mecab/issues/41) I enabled to specify an empty string option to enable to specify node-format option when using UniDic.

massongit commented 6 years ago

I will write a test for this implementation, but I don't know where to write it in tests/test_option_parse.py. Please tell me.

buruzaemon commented 6 years ago

Thank you @massongit for bringing this issue to my attention. I will first confirm this and then open up an issue ticket. Please give me some time to look into this.

buruzaemon commented 6 years ago

OK, this was easy enough to confirm.

I have opened up issue #99 to track this. I will start by coming up with appropriate tests, hopefully for both Windows and UNIX-type platforms. I don't have any tests for dictionaries besides ipadic, so I will need some time to come up with something that can cover Unidic, and perhaps Jumandic as well.

buruzaemon commented 6 years ago

@massongit, thank you for your patience. Here is what I have found:

  1. MeCab gives preference to output-format-type over node-format, etc.
  2. But if you explicitly override this behavior by unsetting output-format-type (specifying an empty string), node-format will then be used.

This behavior of MeCab is consistent across ipadic, jumandic and unidic, and is not a function of the dictionary used.

I expect that your Unidic dicrc has the following lines:

output-format-type = unidic

node-format-unidic = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n

That means that unless you explicitly unset output-format-type by passing MeCab an empty string/name with -O "", the node format will default to node-format-unidic even if you also used -F. If you comment out output-format-type = unidic in your dicrc, then you will see that you don't need -O "".

You are correct that natto-py must likewise be able to accept -O "" in order to mirror this behavior.

Hence, I will be accepting your pull request. Thank you very much! I will come up with some unit tests to cover this new behavior.

massongit commented 6 years ago

@buruzaemon Thank you for confirm and merging!