buruzaemon / natto-py

natto-py combines the Python programming language with MeCab, the part-of-speech and morphological analyzer for the Japanese language.
BSD 2-Clause "Simplified" License
92 stars 13 forks source link

[FIX] Avoid to return broken result #111

Closed himkt closed 4 years ago

himkt commented 4 years ago

natto.MeCab().parse(' 天使のケーキ') returns the broken result since string.strip()` remove whitespace without considering its width (half or full). In this PR, I edit mecab.py such that it explicitly removes half-width whitespaces.

Before

In [1]: import natto

In [2]: nm = natto.MeCab(as_nodes=True)

In [3]: text = ' 天使のケーキ'

In [4]: print(nm.parse(text))
記号,空白,*,*,*,*, , ,
天使    名詞,一般,*,*,*,*,天使,テンシ,テンシ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
ケーキ  名詞,一般,*,*,*,*,ケーキ,ケーキ,ケーキ
EOS

After

In [1]: import natto

In [2]: nm = natto.MeCab()

In [3]: text = ' 天使のケーキ'

In [4]: print(nm.parse(text))
       記号,空白,*,*,*,*, , ,  # <- full-width whitespace here!!
天使    名詞,一般,*,*,*,*,天使,テンシ,テンシ
の      助詞,連体化,*,*,*,*,の,ノ,ノ
ケーキ  名詞,一般,*,*,*,*,ケーキ,ケーキ,ケーキ
EOS
himkt commented 4 years ago

I update test cases to check whether natto successfully handles full-width whitespaces. I also edit test_utf8.txt to insert full-width whitespace. Although the same change is required for test_sjis.txt, I couldn't do because I don't have a Windows environment.

buruzaemon commented 4 years ago

@himkt thank you very much for raising this issue and providing a patch. And thank for you for also providing updated test case. I have access to Windows environment, so I will update the test_sjis.txt file myself.

himkt commented 4 years ago

@buruzaemon Thank you for the quick response! I would really appreciate it if you could release a new version of natto-py including this patch on PyPI. :bow:

I have access to Windows environment, so I will update the test_sjis.txt file myself.

Amazing! Although I think this patch also works on Windows, feel free to mention here if something goes wrong.

buruzaemon commented 4 years ago

I will need some time to make some changes for the tests on Windows. I am going to try to release in a couple of days if I can. I apologize for the delay...

buruzaemon commented 4 years ago

Thanks for waiting! I just released 0.9.1, based off of your patch. Let me know if you find any other issues or if you have other concerns!