SamuraiT / mecab-python3

:snake: mecab-python. you can find original version here:http://taku910.github.io/mecab/
https://pypi.python.org/pypi/mecab-python3
Other
541 stars 52 forks source link

Caught UnicodeDecodeError when use `parseToNode` alone #3

Closed graph226 closed 5 years ago

graph226 commented 8 years ago

When we use tagger.parseToNode(text) alone, sometimes we get such error as:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbb in position 1: invalid start byte

To avoid this, put tagger.parse(text) before parseToNode.

graph226 commented 8 years ago

For modifying this, we can call parse method before parseToNode but I don't know whether it works or not 😇

https://github.com/SamuraiT/mecab-python3/blob/5ee7aa538c8408d61a42aedea9d2f000c86f1ca3/MeCab.py#L282

like

def parseToNode(self, *args):
    self.parse(self, *args)
    return _MeCab.Tagger_parseToNode(self, *args)

Please give me your idea about this.

shiumachi commented 6 years ago

I got the same error and fixed it with the above workaround.

kei-s commented 6 years ago

I investigated the reason of this bug.

In _wrap_Tagger_parseToNode method, this line deletes buf2 because alloc2 is SWIG_NEWOBJ. https://github.com/SamuraiT/mecab-python3/blob/5ee7aa538c8408d61a42aedea9d2f000c86f1ca3/MeCab_wrap.cxx#L6527

In python 2, the buf2 is not deleted because alloc2 is SWIG_OLDOBJ. (MeCab_wrap.cxx is completely same as original @taku910's one. https://github.com/taku910/mecab/blob/master/mecab/python/MeCab_wrap.cxx .)

So, the reason of this bug is in SWIG_AsCharPtrAndSize method. I think this block has something wrong. https://github.com/SamuraiT/mecab-python3/blob/5ee7aa538c8408d61a42aedea9d2f000c86f1ca3/MeCab_wrap.cxx#L3461-L3470 But I don't have the patch to solve this bug at this time. 😕

orangain commented 6 years ago

I got the same problem and found that using the latest version of MeCab solves the problem.

My environment:

This problem seems to be the same as the one reported in https://github.com/taku910/mecab/issues/5, and it has been solved by https://github.com/taku910/mecab/pull/24 merged in Feb 2016.

Alhough this problem occurs only in Python 3, it is not a matter of mecab-python3, but it seems to be a matter of memory management of MeCab itself.

Unfortunately, major package managers such as Homebrew and APT currently offer older version of MeCab based on the source in Feb 2013, which can be obtained from Google Drive.

To avoid this problem without using the workaround mentioned above, you need to build and install MeCab from the latest source on GitHub manually, and then reinstall mecab-python3.

zackw commented 6 years ago

@graph226 I believe this ought to be fixed by using the latest version of the package and the latest version of MeCab, but I cannot be sure because you did not provide a complete test case that I can run for myself. Could you please try your code again? Make sure to use mecab-python3 0.8.3, MeCab 0.996, and a current version of SWIG (I have 3.0.12).

It's been a long time since you reported this bug and perhaps you have moved on, so if I don't hear from you in a month I will close the bug (but feel free to reopen it if you don't get to this until after that, and it's still a problem).

polm commented 5 years ago

Please see the spaCy issue linked above, which provides a Dockerfile and code to reproduce the issue. I think @orangain's explanation is exactly right.

zackw commented 5 years ago

@polm Thanks for the pointer. I think you're right. I am going to consider this bug a concrete reason why we need to ship binary wheels from PyPI with bundled libmecab, so it will be addressed by PR #18, which I will be reviewing and landing Real Soon Now. I'll leave the bug open till then.

zackw commented 5 years ago

Please try the release candidate available from https://test.pypi.org/project/mecab-python3/0.996.2rc2/ , this bug should be corrected. Thank you everyone for your patience. We plan to make a new official release in the next couple of weeks.

zackw commented 5 years ago

0.996.2 has been officially released and this issue should be corrected. Please file a new bug report if you are still having problems with parseToNode.