Open Bladieblah opened 2 years ago
Thanks. This looks good at first glance, but I don't think it's correct. You can have multiple items with zero distance between them. For example, in the case of image hashing, consider the case when more than one image has the same image hash:
import collections
import pybktree
Image = collections.namedtuple('Image', 'data hash')
def image_distance(a, b):
return pybktree.hamming_distance(a.hash, b.hash)
image1 = Image('one', 123)
image2 = Image('two', 42)
image3 = Image('three', 123)
t = pybktree.BKTree(image_distance)
t.add(image1)
t.add(image2)
t.add(image3)
print(list(t))
# [Image(data='one', hash=123), Image(data='two', hash=42), Image(data='three', hash=123)]
Iterating over the tree should produce all three images, even though image1
and image3
have the same hash.
What's your use case where this has come up?
Good point, I didn't realise hashes might be used. I was using it to cluster transaction data (textual data), which contain many duplicate samples. I ended with very deep trees, in one case a tree with 27k layers where it should have been 3 (which can still be many nodes of course). I based this change on the wikipedia article explaining the insertion procedure. The article also only considers textual data though, which probably explains the oversight.
In this case it could be solved by replacing the check on distance with a check if the two items are identical, right?
I added the check, but also kept the check on distance since that one is probably less expensive, especially in the case of images.
Hmm, this does change the behavior (as shown by the failing doctests), because now you can't add more than one of the same item to the tree. This might be more correct, but I'm hesitant to change the behavior without releasing a new major version (which I'm probably not going to do at this point). My suggestion, if you need this, would be to check that you're not adding an equal/identical item with a separate dict.
it's nice if we have a flag to modify the behavior.
There's no check whether an item already exists in the tree, leading to unnecessarily deep trees, this should fix that issue.