anhaidgroup / py_stringmatching

A comprehensive and scalable set of string tokenizers and similarity measures in Python
https://sites.google.com/site/anhaidgroup/projects/py_stringmatching
BSD 3-Clause "New" or "Revised" License
135 stars 16 forks source link

Problem with Py_stringmatching GeneralizedJaccard #75

Closed arz1111 closed 2 years ago

arz1111 commented 2 years ago

I'm using GeneralizedJaccard from Py_stringmatching package to measure the similarity between two strings. According to this document:

... If the similarity of a token pair exceeds the threshold, then the token pair is considered a match ...

For example for word pair 'method' and 'methods' we have:

print(sm.Levenshtein().get_sim_score('method','methods'))
>>0.8571428571428572

The similarity between example word pair is 0.85 and greater than 0.80 ,So this pair must considered a match and I expect that the final GeneralizedJaccard output for two near-duplicate sentences to be equal to 1 but it's 0.97:

import py_stringmatching as sm

str1='All tokenizers have a tokenize method'
str2='All tokenizers have a tokenize methods'

alphabet_tok_set = sm.AlphabeticTokenizer(return_set=True)

gj = sm.GeneralizedJaccard(sim_func=sm.Levenshtein().get_sim_score, threshold=0.8)
print(gj.get_raw_score(alphabet_tok_set.tokenize(str1),alphabet_tok_set.tokenize(str2)))

>>0.9761904761904763

So what is the problem?!

arz1111 commented 2 years ago

The answer is that after considering the pair as a match, the similarity score of that pair used in Jaccard formula instead of 1.