benhoyt / pybktree

Python BK-tree data structure to allow fast querying of "close" matches
MIT License
169 stars 22 forks source link

Misleading time complexity #5

Closed mxmlnkn closed 4 years ago

mxmlnkn commented 4 years ago

I think the time complexity is misleading.

To check N images against each other a lookup complexity of N log(N) is advertised.

However, this is basically never the case. This would only be true if we only descended into one child for each node. That can actually happen when we call BKTree.find( item, 0 ), so we are only interested in items with distance 0. But as soon as you are descending into multiple children on average, you are not O(log N) anymore. These websites also wrongly list O(log N) for the lookup complexity. This website explicitly states that O(log N) is not the case. This can be easily proven wrong with benchmarks and also theoretically.

For the time complexity of the ranged query find all <= d, consider a tree with N elements and a distance measure with values in [0, D). E.g., for 32-bit integers D=32 because the hamming distance can range in [0,31]. Assume further that the tree is pretty well-balanced and saturated, i.e., on each level the tree will have D=32 children. For that ranged query, we calculate the distance to the current node and then have to check all children with distances in [dist_to_current - d, dist_to_current + d]. So, as a rough estimate, we need to check min(2d+1,D) children on each level.

If we have to check m children on each level, then in total we will have to check m^depth children! The number of elements in such a saturated tree would be N=D^depth, so the depth would be depth=ln(N)/ln(D). Combining this, we get m^[ln(N)/ln(D)] = N^[ln(m)/ln(D)] lookups. So, clearly a power law with some broken valued exponent and not a logarithmic scaling!

For a lookup to find all <= d, we have m=min(2d+1,D) resulting in a lookup complexity of N^[ln(min(2d+1,D))/ln(D)]. For D=32 (32 bit integers), we would get these scalings:

or for D=64:

For d>=D/2, we would get O(N) scaling because we would have to look at each element in the tree on average with the given assumptions. The assumption, which is encoded in m=min(2d+1,D), is that the distance to each node has to be exactly D/2, which on average should be true for looking up random values.

How to check this? Well, let's benchmark:

benchmark-pybktree.py

```python3 #!/usr/bin/env python3 # -*- coding: utf-8 -*- import os import time import matplotlib.pyplot as plt import numpy as np import pybktree import scipy.stats import tqdm def benchmark_pybktree( element_counts, repeat_count = 10 ): """ Returns a list of triples: - elements - tree creation time in seconds - lookup time for one element in seconds """ timings = [] for element_count in tqdm.tqdm( element_counts ): timing = [ element_count ] runtimes = [] for i in range( repeat_count ): elements = np.random.randint( np.iinfo( np.uint64 ).max, size = element_count, dtype = np.uint64 ) t0 = time.time() tree = pybktree.BKTree( pybktree.hamming_distance, elements ) t1 = time.time() runtimes.append( t1 - t0 ) timing += [ np.mean( runtimes ), np.std( runtimes ) ] for distance in [ 0, 1, 2, 4, 8, 16 ]: runtimes = [] for i in range( repeat_count ): elements = np.random.randint( np.iinfo( np.uint64 ).max, size = element_count, dtype = np.uint64 ) tree = pybktree.BKTree( pybktree.hamming_distance, elements ) t0 = time.time() results = tree.find( item = np.uint64( 0 ), n = distance ) t1 = time.time() runtimes.append( t1 - t0 ) timing += [ distance, np.mean( runtimes ), np.std( runtimes ) ] timings.append( timing ) return timings def fit_power_law( x, y ): """ In log-log space power laws becom linear laws. Consider the generic f(x) = a*x^b, in log-log space it would be log(f(x)) = log(a) + b*log(x) plotted over log(x). Returns the tuple (a,b), i.e., (factor, exponent). """ slope, intercept, rvalue, pvalue, stderr = scipy.stats.linregress( np.log( x ), np.log( y ) ) return np.exp( intercept ), slope def plot_results( timings, export_name = None ): timings = np.array( timings ) fig = plt.figure() ax = fig.add_subplot( 111, xlabel = "Number of Elements", ylabel = "Execution Time / s", xscale = 'log', yscale = 'log', ) element_counts = timings[:,0] creation_times_mean = timings[:,1] creation_times_std = timings[:,2] ax.errorbar( element_counts, creation_times_mean, creation_times_std, marker = 'o', label = "Tree creation" ) a,b = fit_power_law( element_counts, creation_times_mean ) ax.plot( element_counts, a * element_counts**b, color = '0.5', linestyle = '--' ) print( f"Fitted creation time to {a} N**{b}" ) for i in range( ( timings.shape[1] - 3 ) // 3 ): lookup_distances = timings[:, 3 + 3*i+0] lookup_times_mean = timings[:, 3 + 3*i+1] lookup_times_std = timings[:, 3 + 3*i+2] assert np.all( lookup_distances == lookup_distances[0] ) ax.errorbar( element_counts, lookup_times_mean, lookup_times_std, marker = 'o', label = "Element lookup distance <= {}".format( lookup_distances[0] ) ) a,b = fit_power_law( element_counts, lookup_times_mean ) ax.plot( element_counts, a * element_counts**b, color = '0.5', linestyle = '--' ) print( f"Fitted lookup for distance <= {lookup_distances[0]} time to {a} N**{b}" ) ax.plot( element_counts, np.log( element_counts ) / np.log( element_counts )[1] * creation_times_mean[1], label = 'O(log N)', zorder = 20, color = 'k') ax.legend( loc = 'best' ) fig.tight_layout() if export_name: fig.savefig( export_name + '.pdf' ) fig.savefig( export_name + '.png' ) else: plt.show() if __name__ == '__main__': dataFileName = 'pybktree-scaling.dat' if not os.path.exists( dataFileName ): timings = benchmark_pybktree( np.unique( ( 10 ** np.linspace( 0, 6, 40 ) ).astype( np.int ) ) ) np.savetxt( dataFileName, timings ) plot_results( np.genfromtxt( dataFileName ), export_name = 'pybktree-scaling' ) ```

Here are the results:

pybktree-scaling

The plot is in log-log scale. In log-log scale all power laws in the form of f(x)=ax^b become linear functions in the form of log(a) + b*x. The dashed lines are fits to those linear functions and here are the results for the fitted scaling laws:

For some reason, the exponents seem to be systematically underestimated. But, it can clearly be seen that all except d=0 follow power laws instead of logarithmic scaling. As can be seen in the code, the benchmark ran with 64 bits. So, assuming you have a 64 bit hash and you want to find all with a distance <= 16, you already enter O(N) territory and the bktree becomes useless.

Conclusion: This is less suited than I thought for a lookup in 1M+ files and even less suited if your threshold distance is quite high, which it is in my case. I'm basically edging close to O(N) because of this.

benhoyt commented 4 years ago

This is incredible analysis -- really appreciate you doing that and posting it here. It's been a while now, but I had jumped to the O(log N) conclusion based either on something I'd read, or just simplistically thinking "oh, it's a tree, searching must be log N". But of course as you show that's incorrect.

It worked well for me because my d was small, IIRC I was using 2 or 3 for the dataset I was looking at.

I'll update the README and my related article to correct this, and link to your analysis above.

Thanks again!

benhoyt commented 4 years ago

Made some updates. Thanks again!

mxmlnkn commented 4 years ago

Cool, thank you for the fast reply and edit :). Now I only need to find some other metric tree which works sufficiently for my huge dataset. Maybe I'll have to look for something written in C/C++ or a python module to such a C/C++ library

benhoyt commented 4 years ago

Can you do a linear search (in C)? One million is not such a huge dataset. How fast do you need the query to be?

mxmlnkn commented 4 years ago

I didn't try a linear search in C yet, I was mostly working in python and until now it worked sufficiently fast. Even calculating the dHashes for all the images.

Basically, I'd want to dedupe 100M images. I guess, if it runs in under one month, it would be fine by me, but if it runs in 5 days or so it would be even better. I have a 384-bit dHash (width=16, height=12, both vertical and horizontal gradient => 16*12*2 bits) and tests with 1000 images showed that the hamming distances for differing images are distributed pretty well in a Gaussian with mu=190 and sigma=12. Hamming distances for resized versions of images matched to their originals followed a somewhat Gaussian distribution with mu = 6 and sigma = 6. So, I guess it would be fine for similarity search to search for distance <= 25, but looking at the plot, searching for d<=100 might be better to get some of the outliers.

dhash-384-bit-test