FALCONN-LIB / FALCONN

FAst Lookups of Cosine and Other Nearest Neighbors (based on fast locality-sensitive hashing)
http://falconn-lib.org/
MIT License
1.14k stars 194 forks source link

How to return more points? #104

Open Dronablo opened 6 years ago

Dronablo commented 6 years ago

Hi!

Let's start with the code:

import falconn
import numpy as np

params_cp = falconn.LSHConstructionParameters()
params_cp.dimension = 300
params_cp.lsh_family = falconn.LSHFamily.CrossPolytope
params_cp.distance_function = falconn.DistanceFunction.EuclideanSquared
params_cp.l = 50
params_cp.num_rotations = 2
params_cp.seed = 5721840

params_cp.num_setup_threads = 1
params_cp.storage_hash_table = falconn.StorageHashTable.BitPackedFlatHashTable
# 2**17 <= N_SAMPLES
falconn.compute_number_of_hash_functions(17, params_cp)

table = falconn.LSHIndex(params_cp)

qdata = np.zeros((1000, 300), dtype=np.float32)

for en, i in enumerate(range(0, 1000, 100)):
    qdata[i:i+100][:,en*30:(en+1)*30] = 1

# qdata[0]
# qdata[200]

table.setup(qdata)

centroid_index = table.construct_query_object()
centroid_index.set_num_probes(50)

print(len(centroid_index.find_k_nearest_neighbors(qdata[0], k=200)))

print(len(centroid_index.find_near_neighbors(qdata[0], threshold=10e10)))

So, we have 1000 points in 300-dimensional space, grouped in 10 extremely dense areas (in the example above - grouped into dots). When I make a query to get k nearest neigbours, I would never get more points, than in nearest dense area. So, when I ask for k<100 it's ok, but if I want k=200, or k=1000 falconn returns only 100 points.

Is there any way to configure params in such a way, so above script would print

200
1000

?

Thank you!

ludwigschmidt commented 6 years ago

Have you tried increasing the number of probes?

Dronablo commented 6 years ago

Sure!

Here's little bit simpler code, goal is to get more than 3 results:

import falconn
import numpy as np

params_cp = falconn.LSHConstructionParameters()
params_cp.dimension = 6
params_cp.lsh_family = falconn.LSHFamily.CrossPolytope
params_cp.distance_function = falconn.DistanceFunction.EuclideanSquared
params_cp.l = 50
params_cp.num_rotations = 3
params_cp.seed = 5721840

params_cp.num_setup_threads = 1
params_cp.storage_hash_table = falconn.StorageHashTable.BitPackedFlatHashTable
# 2**17 <= N_SAMPLES
falconn.compute_number_of_hash_functions(17, params_cp)

table = falconn.LSHIndex(params_cp)

qdata = np.array([[1,1,1,1,0,0],
                  [1,2,1,2,0,0],
                  [2,1,2,1,0,0],
                  [0,0,0,1,1,1],
                  [0,0,0,1,2,2],
                 ], dtype=np.float32)

table.setup(qdata)

centroid_index = table.construct_query_object()
centroid_index.set_num_probes(5000)

print(len(centroid_index.find_k_nearest_neighbors(qdata[0], k=200)))
print(len(centroid_index.find_near_neighbors(qdata[0], threshold=10e10)))

Also tried to play with storage_hash_table, num_rotations, l, k - nothing changed.