fake-name / pg-spgist_hamming

Fast hamming-distance range searches via native GiST Indexing facility in PostgreSQL
BSD 3-Clause "New" or "Revised" License
166 stars 18 forks source link

Example performance numbers? #1

Open zslayton opened 6 years ago

zslayton commented 6 years ago

Hi, I found this project by way of your Stack Exchange post. I was considering trying something like this myself and would be very interested to know what sort of precision a BK tree index was able to offer at different Hamming distances. Do you have any example performance numbers (e.g. number of records scanned vs. number of records returned) that you could share? Thanks!

fake-name commented 6 years ago

I don't have numbers, currently, but it's considerably slower then the in-memory BK-Tree I wrote.

It mostly only wins on the convenience front, but MAN it's way more convenient, and it's fast enough.

I'm not actually sure how to go about getting detailed stats out of postgres for this sort of thing, actually.


I'm not sure what you mean by "Precision". This isn't an approximate search, for each query there is only one correct return set, and in the testing I've done (see all the Test_db_BKTree_xxx.py files), it returns that.

yjhjstz commented 5 years ago

have you test Precision and recall in SIFT1M dataset ?

fake-name commented 5 years ago

@yjhjstz - I'm not sure what those are, and this isn't a generic fuzzy image searching facility. It's a database-backed implementation of one specific data structure that makes building a fuzzy image search system easy, assuming your image-hashing function outputs 64-bit hash values where similarity is a function of bitwise hamming distance between two hashes.

Given that, it doesn't have a "precision", as any query has one (and only one) valid output result, and from all the tests I've done seem to indicate that you get the right result.

Note that I have built a fuzzy image searching system on top of this, but it's:

  1. Not in this repo (it's here)
  2. It uses a DCT-based perceptual hashing approach (mildly tweaked here) ((described here)[http://hackerfactor.com/blog/index.php%3F/archives/432-Looks-Like-It.html])
  3. Not really finished.
fake-name commented 5 years ago

Looking at the SIFT1M webpage (here?), I'm not sure what the feature vector they're using looks like, but assuming it obeys the triangle inequality, the approach used here would probably work, though the index would need to be considerably more complicated if the vector size is larger then 64 bits.

This index exploits the fact that the pHash values it's designed to index are the native data size of the index pointers, and therefore stores the hash value directly in the index itself. Postgresql can support out-of-line index data, but it's more indirection and a additional lookup for each index comparison operation to do that, so you'd at least double the number of RAM accesses if you made those changes. It'd probably also have poor cache-locality effects.

yjhjstz commented 5 years ago

thanks , maybe you can try mnist dataset, it has features and label . according to label , precision can be done.

fake-name commented 5 years ago

@yjhjstz - That sounds neat. Let me know if you have performance numbers I can add to the repo.

yjhjstz commented 5 years ago

not only performance, I also care about precision of top-1.

fake-name commented 5 years ago

Again, this is just a database-backed BK-tree-based index implementation. It has no precision.

yjhjstz commented 5 years ago

I know , thanks anain.