benhoyt / pybktree

Python BK-tree data structure to allow fast querying of "close" matches
MIT License
169 stars 22 forks source link

Fix adding duplicate items #7

Open Bladieblah opened 2 years ago

Bladieblah commented 2 years ago

There's no check whether an item already exists in the tree, leading to unnecessarily deep trees, this should fix that issue.

benhoyt commented 2 years ago

Thanks. This looks good at first glance, but I don't think it's correct. You can have multiple items with zero distance between them. For example, in the case of image hashing, consider the case when more than one image has the same image hash:


import collections
import pybktree

Image = collections.namedtuple('Image', 'data hash')

def image_distance(a, b):
    return pybktree.hamming_distance(a.hash, b.hash)

image1 = Image('one', 123)
image2 = Image('two', 42)
image3 = Image('three', 123)

t = pybktree.BKTree(image_distance)
t.add(image1)
t.add(image2)
t.add(image3)
print(list(t))
# [Image(data='one', hash=123), Image(data='two', hash=42), Image(data='three', hash=123)]

Iterating over the tree should produce all three images, even though image1 and image3 have the same hash.

What's your use case where this has come up?

Bladieblah commented 2 years ago

Good point, I didn't realise hashes might be used. I was using it to cluster transaction data (textual data), which contain many duplicate samples. I ended with very deep trees, in one case a tree with 27k layers where it should have been 3 (which can still be many nodes of course). I based this change on the wikipedia article explaining the insertion procedure. The article also only considers textual data though, which probably explains the oversight.

In this case it could be solved by replacing the check on distance with a check if the two items are identical, right?

Bladieblah commented 2 years ago

I added the check, but also kept the check on distance since that one is probably less expensive, especially in the case of images.

benhoyt commented 2 years ago

Hmm, this does change the behavior (as shown by the failing doctests), because now you can't add more than one of the same item to the tree. This might be more correct, but I'm hesitant to change the behavior without releasing a new major version (which I'm probably not going to do at this point). My suggestion, if you need this, would be to check that you're not adding an equal/identical item with a separate dict.

c01o commented 1 year ago

it's nice if we have a flag to modify the behavior.