fake-name / pg-spgist_hamming

Fast hamming-distance range searches via native GiST Indexing facility in PostgreSQL
BSD 3-Clause "New" or "Revised" License
167 stars 18 forks source link

pg-spgist_hamming

Tested on PostgreSQL 9.6 and 11

This repository contains two SP-GiST index extensions.

Principally, they detail my progress while implementing a [BK-Tree][1] as a native C PostgreSQL indexing extension. This is extremely useful for certain types of searches, primarily related to fuzzy-image searching.

The subdirectories are as follows:

Anyways, in benchmarking, the PostgreSQL implementation of a BK-tree is approximately 33% - 50% as fast as the C++ implementation, presumably due to the additional overhead of the PG tuple, SP-GiST and GiST mechanisms. While this is annoying, it's not a significant-enough performance loss to motivate me to continue using the C++ version, due to the significant implementation complexity of maintaining an out-of-database additional index. Adding more RAM to the PostgreSQL host may also help here. My test system has 32 GB of ram, and the C++ BK-Tree implementation alone requires ~18 GB to contain the entire dataset.

[1] : http://blog.notdot.net/2007/4/Damn-Cool-Algorithms-Part-1-BK-Trees
[2] : https://github.com/fake-name/IntraArchiveDeduplicator/blob/92da07a75928b803a23d0e2940c40013da8ea115/deduplicator/bktree.hpp
[3] : https://github.com/fake-name/IntraArchiveDeduplicator/tree/master/Tests


Quickstart:

This module has a simple makefile that uses pg_config to do it's magic. Check you have the pg_config shell command, and that it's output looks reasonable.

If you do, installing is a two steps:

cd bktree
make
sudo make install
sudo make installcheck   # to run tests that check everything installed correctly.

Note that installcheck currently fails on postgresql not 9.5, due to minor changes in the output of EXPLAIN ANALYZE. The extension works correctly, but the tests work by diffing the output of queries executed via psql, so minor changes in the output formatting can produce false breakages.

Once you have it installed:

# Enable extension in current database (Note: This is per-database, so if you want to use it on 
# multiple DBs, you'll have to enable it in each.
CREATE EXTENSION bktree;

# Use the enabled extension to create an index. 
# phash_column MUST be a int64 ("bigint") type.
CREATE INDEX bk_index_name ON table_name USING spgist (phash_column bktree_ops);

# Query across the table within a specified edit distance.
SELECT <columns here> FROM table_name WHERE phash_column <@ (target_phash_int64, search_distance_int);

You'll need to replace things like bk_index_name, table_name, target_phash_int64, search_distance_int, and phash_column with appropriate values for your database.

phash_column must be a column of type bigint. Currently, only 64-bit phash values are supported, and they're stored in signed format.