benhoyt / pybktree

Python BK-tree data structure to allow fast querying of "close" matches
MIT License
169 stars 22 forks source link

BKTree | Identifying similar images from a database #3

Closed rdecaneva closed 6 years ago

rdecaneva commented 6 years ago

Hello: I am fairly new to python, but I found your article very informative and easy to follow. I believe I have the script working for my needs, however as you mention at the bottom of the article the larger the database the slower the comparisons.

I have a script with watchdog that waits for changes on a directory of images. When an image is uploaded, the file is processed, a dhash is generated, and then passed to a SQL database.

I've been experimenting with BKTrees. If I understand them correctly the entire tree is stored in memory at script runtime. My question is how do I identify which image is the actual duplicate from the tree? How can I store a primary key or some unique value in the tree so I can later identify which images are similar to each other?

Thank you!

benhoyt commented 6 years ago

Good question. You do this by using a tuple or namedtuple of (hash, id) for your items. See the example I just added to the README under "If you need to track the ID, key, or filename of the original item, use a tuple or namedtuple. Repeating the above example with an Item namedtuple:"