anhaidgroup / py_stringmatching

A comprehensive and scalable set of string tokenizers and similarity measures in Python
https://sites.google.com/site/anhaidgroup/projects/py_stringmatching
BSD 3-Clause "New" or "Revised" License
135 stars 16 forks source link

SoftTFIDF get_raw_score failing with float division by zero #62

Open rafamonge opened 4 years ago

rafamonge commented 4 years ago

I'm getting an exception while calling the get_raw_score function with the SoftTFIDF similarity measure. It only happens with a specific corpus, which I'm unfortunately unable to share, so the code snipped isnt' fully reproducible.

import py_stringmatching as sm
print(sm.__version__)
soft_tfidf =sm.SoftTfIdf(corpus, threshold=0.9)
soft_tfidf.get_raw_score(['AWN', 'AL'], ['ONEP'])
0.4.1
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-100-fcbb2f491b64> in <module>
      2 print(sm.__version__)
      3 soft_tfidf =sm.SoftTfIdf(corpus, threshold=0.9)
----> 4 soft_tfidf.get_raw_score(['AWN', 'AL'], ['ONEP'])

C:\ProgramData\Anaconda3\lib\site-packages\py_stringmatching\similarity_measure\soft_tfidf.py in get_raw_score(self, bag1, bag2)
    134             v_y = idf * tf_y.get(element, 0)
    135             v_y_2 += v_y * v_y
--> 136         return result if v_x_2 == 0 else result / (sqrt(v_x_2) * sqrt(v_y_2))
    137 
    138     def get_corpus_list(self):

ZeroDivisionError: float division by zero

I added a print right before line 136. The root cause is that v_y_2 is equal to zero.