Closed mirfan899 closed 3 years ago
I have installed mecab for Ubuntu 18.04 using
sudo apt-get install mecab libmecab-dev
sudo apt-get install mecab-ipadic-utf8
Hello, @mirfan899, and thank you for bringing this to my attention.
I notice that there are whitespace chars in the Korean text string in your example. When you input this string into MeCab at the command-line, do you get the output you expect?
Okay, here is the output. What I did is I have installed the https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-1.6.1-20140814.tar.gz and looks like its working fine. The latest mecab-ko-dic not working with natto due to the installation issue mention here. https://bitbucket.org/eunjeon/mecab-ko-dic/issues/25/mecab-ko-dic-configure
Thank you for the additional information.
Now, can you tell me where all the half-width (0x20) chars went?
I have to check that. Not sure how this dic works.
Hello again, @mirfan899
OK, I don't think you need to concern yourself about my question re: whitespace chars, since I took a look at the source code for space/lang/ko/__init__.py
just now.
Could you please do me a favor and use this shortened version of your original string, and let me know what happens (please attach any output).
import spacy
nlp = spacy.blank("ko")
nlp("번홀")
Okay, so the issue was actually the mecab-ko-dic version. After installing the 1.6.1 version everthings works. But when I install the mecab from Ubuntu packages it breaks it.
import spacy
nlp = spacy.blank("ko")
nlp("번홀")
nlp
<spacy.lang.ko.Korean object at 0x7f022fa86eb8>
doc = nlp("번홀")
doc.text
'번홀'
Yes, I suspected that there was some sort of issue in the default output of the underlying dictionary.
In __init__.py
at line 35 there is the following specification on the output format:
self.mecab_tokenizer = MeCab("-F%f[0],%f[7]")
This apparently is to instruct the mecab to use the 1st and 8th elements of whatever the default output is for the dictionary being used.
However, it might be the case that for the token in question, the 1st / 8th elements might not even exist, which causes mecab to throw the "given index is out of range
" error (this floats up through natto-py
, and this is what you are seeing).
This is not a bug in natto-py
, nor is it some problem with mecab itself. Rather, it is likely the case that the aforementioned -F%f[0],%f[7]
output format specification does not take into account (in my opinion) the likely case where the underlying dictionary default output does not contain the expected amount of tokens; or even the case where a word is unknown to the dictionary.
I am going to close this issue here since you seem to have worked things out on your own, but you might wish to raise this issue with the maintainers of the ko
NLP model.
hi, I'm trying to use the spacy library and it requires natto-py support to run tokernzie or other pipelines. When I try tokenize the sentece in Korean it throws following error.
I have tried multiple spacy versions 2.3.5, 3.0.1 Python 3.6.9 Ubuntu 18.04