accel-brain / accel-brain-code

The purpose of this repository is to make prototypes as case study in the context of proof of concept(PoC) and research and development(R&D) that I have written in my website. The main research topics are Auto-Encoders in relation to the representation learning, the statistical machine learning for energy-based models, adversarial generation networks(GANs), Deep Reinforcement Learning such as Deep Q-Networks, semi-supervised learning, and neural network language model for natural language processing.
https://accel-brain.co.jp
GNU General Public License v2.0
310 stars 92 forks source link

No module named 'MeCab' #10

Closed TimurNurlygayanov closed 4 years ago

TimurNurlygayanov commented 4 years ago

Hi, thank you for the great library!

It looks like I've found some issue, here is my example of code:


from pysummarization.nlp_base import NlpBase
from pysummarization.nlpbase.auto_abstractor import AutoAbstractor
from pysummarization.tokenizabledoc.mecab_tokenizer import MeCabTokenizer
from pysummarization.abstractabledoc.top_n_rank_abstractor import TopNRankAbstractor
from pysummarization.similarityfilter.tfidf_cosine import TfIdfCosine

document = """
my long text here
"""

# The object of the NLP.
nlp_base = NlpBase()
# Set tokenizer. This is japanese tokenizer with MeCab.
nlp_base.tokenizable_doc = MeCabTokenizer()

# The object of `Similarity Filter`.
# The similarity observed by this object is so-called cosine similarity of Tf-Idf vectors.
similarity_filter = TfIdfCosine()

# Set the object of NLP.
similarity_filter.nlp_base = nlp_base

# If the similarity exceeds this value, the sentence will be cut off.
similarity_filter.similarity_limit = 0.25

# The object of automatic sumamrization.
auto_abstractor = AutoAbstractor()
# Set tokenizer. This is japanese tokenizer with MeCab.
auto_abstractor.tokenizable_doc = MeCabTokenizer()
# Object of abstracting and filtering document.
abstractable_doc = TopNRankAbstractor()
# Delegate the objects and execute summarization.
result_dict = auto_abstractor.summarize(document, abstractable_doc, similarity_filter)

# Output result.
for i, sentence in enumerate(result_dict["summarize_result"]):
    print(sentence, result_dict["scoring_data"][i])

And result of its execution with Python 3.7.4:

python3 summarizer.py 
Traceback (most recent call last):
  File "summarizer.py", line 3, in <module>
    from pysummarization.tokenizabledoc.mecab_tokenizer import MeCabTokenizer
  File "/usr/local/lib/python3.7/site-packages/pysummarization/tokenizabledoc/mecab_tokenizer.py", line 3, in <module>
    import MeCab
ModuleNotFoundError: No module named 'MeCab'
TimurNurlygayanov commented 4 years ago

Note: I've used the code from the README example.

Claytone commented 4 years ago

Have you tried pip install --upgrade pip and pip install MeCab ?

chimera0 commented 4 years ago

Isn't it better to use SimpleTokenizer instead of MeCabTokenizer?

MeCab is a library for morphological analysis of Japanese natural sentences. If you want to analyze Japanese, you should use this library after installing MeCab, but your target seems to be English natural sentences.

MeCabTokenizer is a TokenizableDoc to tokenize Japanese words, while SimpleTokenizer is a TokenizableDoc to tokenize mainly English words.(I've only tested it in English and Japanese.)

Let's change

from pysummarization.tokenizabledoc.mecab_tokenizer import MeCabTokenizer

to

from pysummarization.tokenizabledoc.simple_tokenizer import SimpleTokenizer

and

nlp_base.tokenizable_doc = MeCabTokenizer()

to

nlp_base.tokenizable_doc = SimpleTokenizer()
chimera0 commented 4 years ago

It doesn't seem to be a critical issue. I'll close it for now.