natto.api.MeCabError: Could not format node with surface XXX: given index is out of range

mirfan899 commented 3 years ago

hi, I'm trying to use the spacy library and it requires natto-py support to run tokernzie or other pipelines. When I try tokenize the sentece in Korean it throws following error.

import spacy
nlp = spacy.blank("ko")

nlp("번홀 보기로 위험하게 출발한 그는 13번홀 버디로 요체를 잡았으나 에서 더블보기를 범해 치명타를 입었다")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "/home/irfan/environments/Korean/lib/python3.6/site-packages/spacy/language.py", line 977, in __call__
    doc = self.make_doc(text)
  File "/home/irfan/environments/Korean/lib/python3.6/site-packages/spacy/language.py", line 1059, in make_doc
    return self.tokenizer(text)
  File "/home/irfan/environments/Korean/lib/python3.6/site-packages/spacy/lang/ko/__init__.py", line 41, in __call__
    dtokens = list(self.detailed_tokens(text))
  File "/home/irfan/environments/Korean/lib/python3.6/site-packages/spacy/lang/ko/__init__.py", line 55, in detailed_tokens
    for node in self.mecab_tokenizer.parse(text, as_nodes=True):
  File "/home/irfan/environments/Korean/lib/python3.6/site-packages/natto/mecab.py", line 415, in __parse_tonodes
    raise MeCabError(msg)
natto.api.MeCabError: Could not format node with surface 번홀: given index is out of range

I have tried multiple spacy versions 2.3.5, 3.0.1 Python 3.6.9 Ubuntu 18.04

mirfan899 commented 3 years ago

I have installed mecab for Ubuntu 18.04 using

sudo apt-get install mecab libmecab-dev
sudo apt-get install mecab-ipadic-utf8

buruzaemon commented 3 years ago

Hello, @mirfan899, and thank you for bringing this to my attention.

I notice that there are whitespace chars in the Korean text string in your example. When you input this string into MeCab at the command-line, do you get the output you expect?

mirfan899 commented 3 years ago

Okay, here is the output. What I did is I have installed the https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-1.6.1-20140814.tar.gz and looks like its working fine. The latest mecab-ko-dic not working with natto due to the installation issue mention here. https://bitbucket.org/eunjeon/mecab-ko-dic/issues/25/mecab-ko-dic-configure

mecab_result

buruzaemon commented 3 years ago

Thank you for the additional information.

Now, can you tell me where all the half-width (0x20) chars went?

mirfan899 commented 3 years ago

I have to check that. Not sure how this dic works.

buruzaemon commented 3 years ago

Hello again, @mirfan899

OK, I don't think you need to concern yourself about my question re: whitespace chars, since I took a look at the source code for space/lang/ko/__init__.py just now.

Could you please do me a favor and use this shortened version of your original string, and let me know what happens (please attach any output).

import spacy
nlp = spacy.blank("ko")

nlp("번홀")

mirfan899 commented 3 years ago

Okay, so the issue was actually the mecab-ko-dic version. After installing the 1.6.1 version everthings works. But when I install the mecab from Ubuntu packages it breaks it.

import spacy
nlp = spacy.blank("ko")
nlp("번홀")
nlp
<spacy.lang.ko.Korean object at 0x7f022fa86eb8>
doc = nlp("번홀")
doc.text
'번홀'

buruzaemon commented 3 years ago

Yes, I suspected that there was some sort of issue in the default output of the underlying dictionary.

In __init__.py at line 35 there is the following specification on the output format:

    self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")

This apparently is to instruct the mecab to use the 1st and 8th elements of whatever the default output is for the dictionary being used.

However, it might be the case that for the token in question, the 1st / 8th elements might not even exist, which causes mecab to throw the "given index is out of range" error (this floats up through natto-py, and this is what you are seeing).

This is not a bug in natto-py, nor is it some problem with mecab itself. Rather, it is likely the case that the aforementioned -F%f[0],%f[7] output format specification does not take into account (in my opinion) the likely case where the underlying dictionary default output does not contain the expected amount of tokens; or even the case where a word is unknown to the dictionary.

I am going to close this issue here since you seem to have worked things out on your own, but you might wish to raise this issue with the maintainers of the ko NLP model.

buruzaemon / natto-py

natto.api.MeCabError: Could not format node with surface XXX: given index is out of range #116