buruzaemon / natto-py

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
BSD 2-Clause "Simplified" License
92 stars 13 forks source link

natto-py

What is natto-py?

A package leveraging FFI (foreign function interface), natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language. No compiler is necessary, as it is not a C extension. natto-py will run on Mac OS, Windows and *nix.

You can learn more about natto-py at GitHub_.

If you are still using Python 2 after sunset_, please stick with version natto-py==0.9.2.

|version| |pyversions| |license| |github-actions| |readthedocs|

Requirements

natto-py requires the following:

The following Python 3 versions are supported:

For Python 2, please use version 0.9.2.

Installation

Install natto-py as you would any other Python package:

.. code-block:: bash

$ pip install natto-py

This will automatically install the cffi package, which natto-py uses to bind to the mecab library.

Automatic Configuration

As long as the mecab (and mecab-config for *nix and Mac OS) executables are on your PATH, natto-py does not require any explicit configuration.

Explicit configuration via MECAB_PATH and MECAB_CHARSET

If natto-py for some reason cannot locate the mecab library, or if it cannot determine the correct charset used internally by mecab, then you will need to set the MECAB_PATH and MECAB_CHARSET environment variables.

e.g., for Mac OS:

.. code-block:: bash

export MECAB_PATH=/usr/local/Cellar/mecab/0.996/lib/libmecab.dylib
export MECAB_CHARSET=utf8

e.g., for bash on UNIX/Linux:

.. code-block:: bash

export MECAB_PATH=/usr/local/lib/libmecab.so
export MECAB_CHARSET=euc-jp

e.g., on Windows:

.. code-block:: bat

set MECAB_PATH=C:\Program Files\MeCab\bin\libmecab.dll
set MECAB_CHARSET=shift-jis

e.g., from within a Python program:

.. code-block:: python

import os

os.environ['MECAB_PATH']='/usr/local/lib/libmecab.so'
os.environ['MECAB_CHARSET']='utf-16'

Usage

Here's a very quick guide to using natto-py.

Instantiate a reference to the mecab library, and display some details:

.. code-block:: python

from natto import MeCab

nm = MeCab()
print(nm)

# displays details about the MeCab instance
<natto.mecab.MeCab
 model=<cdata 'mecab_model_t *' 0x801c16300>,
 tagger=<cdata 'mecab_t *' 0x801c17470>,
 lattice=<cdata 'mecab_lattice_t *' 0x801c196c0>,
 libpath="/usr/local/lib/libmecab.so",
 options={},
 dicts=[<natto.dictionary.DictionaryInfo
         dictionary='mecab_dictionary_info_t *' 0x801c19540>,
         filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",
         charset=utf8,
         type=0],
 version=0.996>

Display details about the mecab system dictionary used:

.. code-block:: python

sysdic = nm.dicts[0]
print(sysdic)

# displays the MeCab system dictionary info
<natto.dictionary.DictionaryInfo
 dictionary='mecab_dictionary_info_t *' 0x801c19540>,
 filepath="/usr/local/lib/mecab/dic/ipadic/sys.dic",
 charset=utf8,
 type=0>

Parse Japanese text and send the MeCab result as a single string to stdout:

.. code-block:: python

print(nm.parse('ピンチの時には必ずヒーローが現れる。'))

# MeCab result as a single string
ピンチ    名詞,一般,*,*,*,*,ピンチ,ピンチ,ピンチ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
時      名詞,非自立,副詞可能,*,*,*,時,トキ,トキ
に      助詞,格助詞,一般,*,*,*,に,ニ,ニ
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
必ず    副詞,助詞類接続,*,*,*,*,必ず,カナラズ,カナラズ
ヒーロー  名詞,一般,*,*,*,*,ヒーロー,ヒーロー,ヒーロー
が      助詞,格助詞,一般,*,*,*,が,ガ,ガ
現れる  動詞,自立,*,*,一段,基本形,現れる,アラワレル,アラワレル
。      記号,句点,*,*,*,*,。,。,。
EOS

Next, try parsing the text with MeCab node parsing. A generator yielding the MeCabNode instances lets you efficiently iterate over the output without first materializing each and every resulting MeCabNode instance. The MeCabNode instances yielded allow access to more detailed information about each morpheme.

Here we use a Python with-statement_ to automatically clean up after we finish node parsing with the MeCab tagger. This is the recommended approach for using natto-py in a production environment:

.. code-block:: python

# Use a Python with-statement to ensure mecab_destroy is invoked
#
with MeCab() as nm:
    for n in nm.parse('ピンチの時には必ずヒーローが現れる。', as_nodes=True):
...     # ignore any end-of-sentence nodes
...     if not n.is_eos():
...         print('{}\t{}'.format(n.surface, n.cost))
...
ピンチ    3348
の        3722
時        5176
に        5083
は        5305
必ず    7525
ヒーロー   11363
が       10508
現れる   10841
。        7127

MeCab output formatting is extremely flexible and is highly recommended for any serious natural language processing task. Rather than parsing the MeCab output as a single, large string, use MeCab's --node-format option (short form -F) to customize the node's feature attribute.

It is good practice when using --node-format to also specify node formatting in the case where the morpheme cannot be found in the dictionary, by using --unk-format (short form -U).

This example formats the node feature to capture the items above as a comma-separated value:

.. code-block:: python

# MeCab options used:
#
# -F    ... short-form of --node-format
# %m    ... morpheme surface
# %f[0] ... part-of-speech
# %h    ... part-of-speech id (ipadic)
# %f[8] ... pronunciation
# 
# -U    ... short-form of --unk-format
#           output ?,?,?,? for morphemes not in dictionary
#
with MeCab(r'-F%m,%f[0],%h,%f[8]\n -U?,?,?,?\n') as nm:
    for n in nm.parse('ピンチの時には必ずヒーローが現れる。', as_nodes=True):
...     # only normal nodes, ignore any end-of-sentence and unknown nodes
...     if n.is_nor():
...         print(n.feature)
...
ピンチ,名詞,38,ピンチ
の,助詞,24,ノ
時,名詞,66,トキ
に,助詞,13,ニ
は,助詞,16,ワ
必ず,副詞,35,カナラズ
ヒーロー,名詞,38,ヒーロー
が,助詞,13,ガ
現れる,動詞,31,アラワレル
。,記号,7,。

Partial parsing_ (制約付き解析), allows you to pass hints to MeCab on how to tokenize morphemes when parsing. Most useful are boundary constraint parsing and feature constraint parsing.

With boundary constraint parsing, you can specify either a compiled re regular expression object or a string to tell MeCab where the boundaries of a morpheme should be. Use the boundary_constraints keyword. For hints on tokenization, please see Regular expression operations and re.finditer in particular.

This example uses the -F node-format option to customize the resulting MeCabNode feature attribute to extract:

Note that any such morphemes captured will have node stat status of 1 (unknown):

.. code-block:: python

import re

with MeCab(r'-F%m,\s%f[0],\s%s\n') as nm:

    text = '俺は努力したよっ? お前の10倍、いや100倍1000倍したよっ!'

    # capture 10倍, 100倍 and 1000倍 as single parts-of-speech
    pattern = re.compile('10+倍') 

    for n in nm.parse(text, boundary_constraints=pattern, as_nodes=True):
...     print(n.feature)
...
俺, 名詞, 0
は, 助詞, 0
努力, 名詞, 0
し, 動詞, 0
たよっ, 動詞, 0
?, 記号, 0
お前, 名詞, 0
の, 助詞, 0
10倍, 名詞, 1
、, 記号, 0
いや, 接続詞, 0
100倍, 名詞, 1
1000倍, 名詞, 1
し, 動詞, 0
たよっ, 動詞, 0
!, 記号, 0
EOS

With feature constraint parsing, you can provide instructions to MeCab on what feature to use for a matching morpheme. Use the feature_constraints keyword to pass in a tuple containing elements that themselves are tuple instances with a specific morpheme (str) and a corresponding feature (str), in order of constraint precedence:

.. code-block:: python

with MeCab(r'-F%m,\s%f[0],\s%s\n') as nm:

    text = '心の中で3回唱え、 ヒーロー見参!ヒーロー見参!ヒーロー見参!'
    features = (('ヒーロー見参', '感動詞'),)

    for n in nm.parse(text, feature_constraints=features, as_nodes=True):
...     print(n.feature)
...
心, 名詞, 0
の, 助詞, 0
中, 名詞, 0
で, 助詞, 0
3, 名詞, 1
回, 名詞, 0
唱え, 動詞, 0
、, 記号, 0
ヒーロー見参, 感動詞, 1
!, 記号, 0
ヒーロー見参, 感動詞, 1
!, 記号, 0
ヒーロー見参, 感動詞, 1
!, 記号, 0
EOS

Learn More

Contributing to natto-py

Changelog

Please see the CHANGELOG for the release history.

Copyright

Copyright |copy| 2022, Brooke M. Fujita. All rights reserved. Please see the LICENSE file for further details.

.. |version| image:: https://badge.fury.io/py/natto-py.svg :target: https://pypi.org/project/natto-py/ .. |pyversions| image:: https://img.shields.io/pypi/pyversions/natto-py.svg?style=flat .. |github-actions| image:: https://github.com/buruzaemon/natto-py/actions/workflows/automated-test-actions.yml/badge.svg .. |license| image:: https://img.shields.io/badge/license-BSD-blue.svg :target: https://raw.githubusercontent.com/buruzaemon/natto-py/master/LICENSE .. |readthedocs| image:: https://readthedocs.org/projects/natto-py/badge/?version=master :target: http://natto-py.readthedocs.org/en/master/?badge=master :alt: Documentation Status .. _Python: http://www.python.org/ .. _MeCab: http://taku910.github.io/mecab/ .. _Python 2 after sunset: https://www.python.org/doc/sunset-python-2/ .. _IPA: http://taku910.github.io/mecab/#download .. _Juman: http://taku910.github.io/mecab/#download .. _Unidic: http://taku910.github.io/mecab/#download .. _natto-py at GitHub: https://github.com/buruzaemon/natto-py .. _MeCab 0.996: http://taku910.github.io/mecab/#download .. _cffi 0.8.6: https://bitbucket.org/cffi/cffi .. _Python 3.7: https://docs.python.org/3.7/whatsnew/3.7.html .. _Python 3.8: https://docs.python.org/3.8/whatsnew/3.8.html .. _Python 3.9: https://docs.python.org/3.9/whatsnew/3.9.html .. _Python 3.10: https://docs.python.org/3/whatsnew/3.10.html .. _NLTK3's lead: https://github.com/nltk/nltk/wiki/Porting-your-code-to-NLTK-3.0 .. _Python with-statement: https://www.python.org/dev/peps/pep-0343/ .. _Partial parsing: http://taku910.github.io/mecab/partial.html .. _Regular expression operations: https://docs.python.org/3/library/re.html .. _re.finditer: https://docs.python.org/3/library/re.html#re.finditer .. _project Wiki: https://github.com/buruzaemon/natto-py/wiki .. _project's notebooks directory: https://github.com/buruzaemon/natto-py/tree/master/notebooks .. _API documentation on Read the Docs: http://natto-py.readthedocs.org/en/master/ .. _git: http://git-scm.com/downloads .. _check out the latest code at GitHub: https://github.com/buruzaemon/natto-py .. _Browse the issue tracker: https://github.com/buruzaemon/natto-py/issues .. _Sphinx: http://sphinx-doc.org/ .. _twine: https://github.com/pypa/twine .. _unittest: http://pythontesting.net/framework/unittest/unittest-introduction/ .. _PyYAML: https://github.com/yaml/pyyaml .. |copy| unicode:: 0xA9 .. copyright sign