benhoyt / pybktree

Python BK-tree data structure to allow fast querying of "close" matches
MIT License
170 stars 22 forks source link

Strange behaviour with distance when adding a new entry #4

Closed CMCDragonkai closed 6 years ago

CMCDragonkai commented 6 years ago

I'm attempting to use pybktree for string similarity.

import pybktree
from fuzzywuzzy import fuzz

# this returns a number from 0 to 100
# where 0 is no distance at all
# 100 is maximum distance
def fuzzy_distance(s1, s2):
    similarity = fuzz.token_sort_ratio(s1, s2)
    distance = -1 * (similarity - 100)
    return distance

tree = pybktree.BKTree(fuzzy_distance, [
    'Sungrow Power Supply Co Ltd',
    'sma solar technology ag',
])

tree.find('Sungrow Power Supply Co Ltd', 0)
# [(0, 'Sungrow Power Supply Co Ltd')]

But compare with this:

import pybktree
from fuzzywuzzy import fuzz

# this returns a number from 0 to 100
# where 0 is no distance at all
# 100 is maximum distance
def fuzzy_distance(s1, s2):
    similarity = fuzz.token_sort_ratio(s1, s2)
    distance = -1 * (similarity - 100)
    return distance

tree = pybktree.BKTree(fuzzy_distance, [
    'sma solar technology ag',
])

tree.add('Sungrow Power Supply Co Ltd')

tree.find('Sungrow Power Supply Co Ltd', 0)
# []
tree.find('Sungrow Power Supply Co Ltd', 5)
# [(0, 'Sungrow Power Supply Co Ltd')]

The reason I'm checking for something with 0 distance is to figure out whether a string exists in the tree or not. I need to prevent duplicate submissions.

I'm also finding I need the ability to prune the tree, by removing strings that were already added.

CMCDragonkai commented 6 years ago

DW, I just found that the fuzzy wuzzy metric is not symmetrical on the order.

fuzzy_distance('sma solar technology ag', 'Sungrow Power Supply Co Ltd') # 76
fuzzy_distance( 'Sungrow Power Supply Co Ltd', 'sma solar technology ag') # 72