chartbeat-labs / textacy

NLP, before and after spaCy
https://textacy.readthedocs.io
Other
2.21k stars 249 forks source link

ZeroDivisionError and ValueError in keyterms extraction with sCAKE #270

Closed lshostenko closed 5 years ago

lshostenko commented 5 years ago

Got some unexpected behavior while experimenting with the new functionality in textacy==0.8.0. I would expect empty result instead of exception.

import spacy
from textacy.ke import scake

nlp = spacy.load('en_core_web_lg')

text_1 = '• American University of Dubai (Media City). • Business Bay of Dubai (Bay Avenue). • Sheikh Zayed of Dubai (Al Durrah Tower). • Abu Dhabi Airport. INTERESTED IN A FRANCHISE? FOLLOW US ON. DOWNLOAD MENU. shawarmanji in dubai.\n• American University of Dubai (Media City). • Business Bay of Dubai (Bay Avenue). • Sheikh Zayed of Dubai (Al Durrah Tower). • Abu Dhabi Airport. FOLLOW US ON. DOWNLOAD MENU.'
doc_1 = nlp(text_1)

scake(doc_1, include_pos=('NOUN', 'ADJ'))  # results in ZeroDivisionError
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-7-6936e9cd8bd4> in <module>
----> 1 scake(doc, include_pos=('NOUN', 'ADJ'))

~/py/ocean/ocean_ai/env/lib/python3.6/site-packages/textacy/ke/scake.py in scake(doc, normalize, include_pos, topn)
     89     )
     90 
---> 91     word_scores = _compute_word_scores(doc, graph, cooc_mat, normalize)
     92 
     93     # generate a list of candidate terms

~/py/ocean/ocean_ai/env/lib/python3.6/site-packages/textacy/ke/scake.py in _compute_word_scores(doc, graph, cooc_mat, normalize)
    129     sem_connectivities = {
    130         w: len(set(max_truss_levels[nbr] for nbr in graph.neighbors(w))) / max_truss_level
--> 131         for w in word_strs
    132     }
    133     # "positional weight" component

~/py/ocean/ocean_ai/env/lib/python3.6/site-packages/textacy/ke/scake.py in <dictcomp>(.0)
    129     sem_connectivities = {
    130         w: len(set(max_truss_levels[nbr] for nbr in graph.neighbors(w))) / max_truss_level
--> 131         for w in word_strs
    132     }
    133     # "positional weight" component

ZeroDivisionError: division by zero
text_2 = 'Tricycle provides simulated carpet sampling, online design tools and services for the flooring industry.'
doc_2 = nlp(text_2)

scake(doc_2)  # results in ValueError
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-11-6936e9cd8bd4> in <module>
----> 1 scake(doc, include_pos=('NOUN', 'ADJ'))

~/py/ocean/ocean_ai/env/lib/python3.6/site-packages/textacy/ke/scake.py in scake(doc, normalize, include_pos, topn)
     66     cooc_mat = collections.Counter()
     67     n_sents = itertoolz.count(doc.sents)  # in case doc only has 1 sentence
---> 68     for sent1, sent2 in itertoolz.sliding_window(min(2, n_sents), doc.sents):
     69         window_words = (
     70             word

ValueError: not enough values to unpack (expected 2, got 1)

I used python 3.6.7 on Ubuntu 18.04

bdewilde commented 5 years ago

Thanks for letting me know! I've pushed a fix to the dev branch, which will get included in the next release, coming soon.