buruzaemon / natto-py

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
BSD 2-Clause "Simplified" License
92 stars 13 forks source link

ERROR: UnicodeDecodeError: 'utf-8' codec can't decode byte *** in position 0: unexpected end of data #106

Open markdevel opened 6 years ago

markdevel commented 6 years ago

I encountered an error when running the following code. I think that it happens when two or more keywords contained in a parsing text and they are adjacent to each other with the delimiter between them.

# -*- coding: utf-8 -*-
from natto import MeCab
text = 'a aあ'
with MeCab() as nm:
    for n in nm.parse(text, boundary_constraints='a', as_nodes=True):
        print(n)

output

~$ ./test.py
<natto.node.MeCabNode node=<cdata 'mecab_node_t *' 0x2d0d910>, stat=1, surface="a", feature="名詞,固有名詞,組織,*,*,*,*">
<natto.node.MeCabNode node=<cdata 'mecab_node_t *' 0x2d0dad0>, stat=1, surface="a", feature="名詞,一般,*,*,*,*,*">
MECAB_NBEST request type is not set
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/natto/mecab.py", line 397, in __parse_tonodes
    surf = self.__bytes2str(raws).strip()
  File "/usr/local/lib/python3.5/dist-packages/natto/support.py", line 26, in bytes2str
    return b.decode(py3enc)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe3 in position 0: unexpected end of data

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "./test.py", line 6, in <module>
    for n in nm.parse(text, boundary_constraints='a', as_nodes=True):
  File "/usr/local/lib/python3.5/dist-packages/natto/mecab.py", line 427, in __parse_tonodes
    raise MeCabError(self.__bytes2str(self.__ffi.string(err)))
natto.api.MeCabError: MECAB_NBEST request type is not set

environment

~$ cat /etc/*-release;python -V; mecab -P;
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04.4 LTS"
NAME="Ubuntu"
VERSION="16.04.4 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.4 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
Python 3.5.2
bos-feature: BOS/EOS,*,*,*,*,*,*,*,*
bos-format: 
config-charset: UTF-8
cost-factor: 700
dicdir: /var/lib/mecab/dic/debian
dump-config: 1
eon-format: 
eos-format: EOS\n
eos-format-chasen: EOS\n
eos-format-chasen2: EOS\n
eos-format-simple: EOS\n
eos-format-yomi: \n
eval-size: 8
lattice-level: 0
max-grouping-size: 24
nbest: 1
node-format: %m\t%H\n
node-format-chasen: %m\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
node-format-chasen2: %M\t%f[7]\t%f[6]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
node-format-simple: %m\t%F-[0,1,2,3]\n
node-format-yomi: %pS%f[7]
theta: 0.75
unk-eval-size: 4
unk-format: %m\t%H\n
unk-format-chasen: %m\t%m\t%m\t%F-[0,1,2,3]\t\t\n
unk-format-chasen2: %M\t%m\t%m\t%F-[0,1,2,3]\t\t\n
unk-format-yomi: %M
buruzaemon commented 6 years ago

Thank you for raising this issue, @markdevel. I will have a closer look at this.

buruzaemon commented 6 years ago

MeCab's expected behavior for the usage pattern described above has been confirmed per below.

Case 1: ASCII whitespace between 2 chars specified as boundary constraint: (natto-py-36) F:\Area52\home\buruzaemon\dev\github\natto-py>echo a aあ | mecab a 感動詞,,,,,, a 感動詞,,,,,, あ フィラー,,,,,*,あ,ア,ア EOS

Case 2: Full-width 空白 char between 2 chars specified as boundary constraint: (natto-py-36) F:\Area52\home\buruzaemon\dev\github\natto-py>echo a aあ | mecab a 名詞,固有名詞,組織,,,,   記号,空白,,,,, , ,  a 感動詞,,,,,, あ フィラー,,,,,*,あ,ア,ア EOS

If natto.py is to conform to the prescribed behavior, then some changes need to be made to natto/mecab.py and natto/support.py with respect to whitespace handling in the yield of tokens, etc.