buruzaemon / natto-py

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
BSD 2-Clause "Simplified" License
92 stars 13 forks source link

How to switch between two dictionaries? #97

Closed technolingo closed 6 years ago

technolingo commented 7 years ago

I'm using Mecab to process both Japanese and Korean texts, I have two dictionary files. How do I specify a particular dictionary when instantiating Mecab in a function? Thank you!

buruzaemon commented 7 years ago

Hello, @evilplanet, and thanks for using natto-py...

If you mean to use 2 dictionary files separately, then you could create 2 separate instances of MeCab.

You can specify a system dictionary with the mecab option --dicdir or a custom user dictionary with --userdic. Pass in these options when you instantiate MeCab.

technolingo commented 6 years ago

@buruzaemon Thank you for your prompt reply. Sorry, I'm a bit new to this. Below is my code, let's say I want to use the dictionary installed at /usr/local/lib/mecab/dic/mecab-ko-dic/sys.dic here, (My default Japanese dict is /usr/local/lib/mecab/dic/ipadic/sys.dic) how should I do it?

from natto import MeCab

with MeCab('-Owakati') as nm:
    segmented_text = nm.parse(text)

Thank you very much!!

technolingo commented 6 years ago

Below is my /usr/local/etc/mecabrc file:

; Configuration file of MeCab
;
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
;
dicdir =  /usr/local/lib/mecab/dic/ipadic

; userdic = /usr/local/lib/mecab/dic/mecab-ko-dic

; output-format-type = wakati
; input-buffer-size = 8192

; node-format = %m\n
; bos-format = %S\n
; eos-format = EOS\n

When I tried to uncomment userdic, I can no longer instantiate MeCab.

buruzaemon commented 6 years ago

natto-py honors nearly the exact same options as when you use mecab from the command-line.

So the following two approaches are equivalent:

# mecab from command-line
mecab --dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic

# with natto-py
nm = MeCab('--dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic')

It is a good idea to try out your choice of options first at the mecab command-line before using them in instantiating MeCab() with natto-py.

Hope that helps!

technolingo commented 6 years ago

It certainly helps! I wasn't sure how to set dict path and the other option together. Then I tried to chain them in a single string like this MeCab('--dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic -Owakati') And it seems to be working. Thank you a lot!

buruzaemon commented 6 years ago

@evilplanet, I am glad to hear that you were able to correctly use both -O and --dicdir together.

natto-py is meant to use the options in the same manner as they are passed to the mecab command-line in order to be as familiar as possible, and so that is why you have to specify all of the options at instantiation time.

Beside using a single, long options string, you can also use key-value pairs in dict.

Alternately, if you find that you have a lot of options which you want to manage in a custom configuration file like your mecabrc file, you could use --rcfile.

For example:

# a single, long string
MeCab('--dicdir=/usr/local/lib/mecab/dic/mecab-ko-dic -Owakati')

# a dict with key-values
MeCab({ 'dicdir': '/usr/local/lib/mecab/dic/mecab-ko-dic',
        'output_format_type': 'wakati' })

# or if you want to put all of your options in an rcfile
MeCab('--rcfile=/path/to/custom/rcfile/')

You can review the various MeCab options in the project Wiki's Appendix B: Supported MeCab Options.

technolingo commented 6 years ago

@buruzaemon Great!! Thanks a lot. That's very informative!