edkinsgael / airhead-research

Automatically exported from code.google.com/p/airhead-research
0 stars 0 forks source link

Jaccard Index is both slow and incorrect #102

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
What steps will reproduce the problem?
1. Take two sparse vectors, A and B, with 100k dimensions each and the 
following nonzero values:
A: 1 -> 10, 100000 -> 5
B: 2 -> 10, 100000 -> 5
1. Run Similarity.jaccardIndex(A, B)
2. Wait very patiently
3. see that they are equivalent and score 1.0.

What is the expected output? What do you see instead?
the score should be 1/2, since feature 1 only appeared in A and feature 2 only 
appeared in B.  Since the Jaccard Index is an evaluation of feature sets, not 
an evaluation of feature occurrences.  The frequency of the observation 
shouldn't matter, just the existence of some observation.  

Original issue reported on code.google.com by FozzietheBeat@gmail.com on 20 Sep 2011 at 5:33

GoogleCodeExporter commented 8 years ago
Actually, the result shouldn't be 1/2, it should be 1/3 since there are three 
unique features and one overlapping.

Original comment by FozzietheBeat@gmail.com on 20 Sep 2011 at 6:23