Open tverho opened 3 years ago
Have you tried varying the values for M
and ef
?
Edit: just saw that it was actually the case.
How does classic knn work on the dataset? Hnsw has in its test a compatibility requirement i.e. order should match
I modified the comparison test case found in the tests folder to use my data set. The k=1 tests fail, but the k=10 and k=20 test pass, meaning that on average, 90% of the actual nearest neighbors are found. However, when I add a condition to check that at least one actual NN is found for every query, it fails. The same condition passes in the original test using randomized data.
I guess the challenge in my data set is that the points are not uniformly distributed, but there are large empty areas.
Here's the test case
using NearestNeighbors
using HNSW
using StaticArrays
using Statistics
using DelimitedFiles
using Test
@testset "Compare To NearestNeighbors.jl with nonuniform data" begin
dim = 3
num_queries = 1000
data = readdlm("points.csv")
data = [SVector{dim}(data[i,:]) for i in 1:size(data,1)]
cell = Float32[200, 200, 400]
num_elements = length(data)
tree = KDTree(data)
queries = [SVector{dim}(rand(Float32, dim).*cell) for n ∈ 1:num_queries]
@testset "M=$M, K=1" for M ∈ [5, 10]
k = 1
efConstruction = 20
ef = 20
realidxs, realdists = knn(tree, queries, k)
hnsw = HierarchicalNSW(data; efConstruction=efConstruction, M=M, ef=ef)
add_to_graph!(hnsw)
idxs, dists = knn_search(hnsw, queries, k)
ratio = mean(map(idxs, realidxs) do i,j
length(i ∩ j) / k
end)
@test ratio > 0.99
end
@testset "Large K, low M=$M" for M ∈ [5,10]
efConstruction = 100
ef = 50
hnsw = HierarchicalNSW(data; efConstruction=efConstruction, M=M, ef=ef)
add_to_graph!(hnsw)
@testset "K=$K" for K ∈ [10,20]
realidxs, realdists = knn(tree, queries, K)
idxs, dists = knn_search(hnsw, queries, K)
ratios = map(idxs, realidxs) do i,j
length(i ∩ j) / K
end
ratio = mean(ratios)
@test ratio > 0.9
@test all(ratios .> 0.0)
end
end
@testset "Low Recall Test" begin
k = 1
efConstruction = 20
M = 5
hnsw = HierarchicalNSW(data; efConstruction=efConstruction, M=M)
check_counter = 0
add_to_graph!(hnsw) do i
check_counter += i
end
@test check_counter == (1 + num_elements) * num_elements ÷ 2
set_ef!(hnsw, 2)
realidxs, realdists = knn(tree, queries, k)
idxs, dists = knn_search(hnsw, queries, k)
recall = mean(map(idxs, realidxs) do i,j
length(i ∩ j) / k
end)
@test recall > 0.6
end
end
I replaced PeriodicEuclidean(cell)
with Euclidean()
, and nothing output.
Actually, PeriodicEuclidean
is not a distance.
I'm trying to replace my regular grid search with HNSW, but HNSW seems to fail rather spectacularly in finding the nearest neighbor in some cases. I understand that it's an approximate method, but it's a bit underwhelming if when I ask for 50 nearest neighbors, the closest of them is 3x farther than the actual nearest neighbor. What I'd really want is to get, say, 10 neighbors so that I could be fairly certain that at least ~5 of the actual nearest neighbors are be included.
Am I doing something wrong or is HNSW the wrong method for my need?
Below is a minimal example with my data (data file attached, they are nodes of a surface mesh). I've tried playing with the parameters of HierarchicalNSW but they don't seem to have much effect.
Output:
points.csv