Compilation time issues with very high dimensions

jbrea commented 1 year ago

Not that it would make a lot of sense to use KNN in high dimensions... but for demonstration purposes I was running KNN in high dimensions and ran into excessively long compilation times. To illustrate this I compared it to a very simple custom implementation of KNN.

using Distances, NearestNeighbors
function knn_nearest_neigbors(data, points, k)
    kdtree = KDTree(data)
    knn(kdtree, points, k, true)[1]
end
function simple_knn(data, points, k)
    similarities = pairwise(Euclidean(), data, points)
    [partialsortperm(col, 1:k) for col in eachcol(similarities)]
end
function generate_data(d, f = 10)
    (data = randn(d, f*d),
     points = randn(d, f*d))
end
function time(d; f = 10, k = 5)
    data = generate_data(d, f)
    t1 = @elapsed res1 = knn_nearest_neigbors(data..., k)
    t2 = @elapsed knn_nearest_neigbors(data..., k)
    t3 = @elapsed res2 = simple_knn(data..., k)
    @assert res1 == res2
    (t1, t2, t3)
end
ds = [3, 10, 50, 100, 500, 1000]
ds2 = ds[1:end-1] .+ 1

julia> result1 = [time(d) for d in ds]
6-element Vector{Tuple{Float64, Float64, Float64}}:
 (0.242508554, 3.73e-5, 0.050278772)
 (0.267086547, 0.00012804, 0.000169708)
 (0.370471466, 0.009566, 0.015493307)
 (0.641421419, 0.028077737, 0.021804943)
 (11.345050814, 6.748174605, 0.62233861)
 (77.709297929, 61.20434365, 2.881035022)

julia> result2 = [time(d, f = 30) for d in ds2]
5-element Vector{Tuple{Float64, Float64, Float64}}:
 (0.22667584, 0.000104276, 0.000274484)
 (0.2518278, 0.000852078, 0.002162456)
 (0.536475271, 0.096408936, 0.057618148)
 (0.996733694, 0.329291918, 0.221131161)
 (61.910340924, 56.781440873, 4.862777863)

knn (The blue point at nd = 3 includes the compilation time of simple_knn)

The performance of simple_knn becomes better already at nd = 100 and for nd > 1000 the time it takes to run knn increases a lot.

Would it make sense to

implement in this package a fallback to something like simpe_knn for large nd?
abort the computation and inform the user that this package is not designed for large nd?
adapt the claim "perform high performance nearest neighbor searches in arbitrarily high dimensions."?

KristofferC commented 1 year ago

NearestNeighbors uses StaticArrays which is built on top of tuples which has compilation time issues when they get very big.

adapt the claim "perform high performance nearest neighbor searches in arbitrarily high dimensions."?

I removed this note in the README. We could add some blurb warning about this scenario but I am not sure it is worth doing something about.

nstiurca commented 1 year ago

I'm running into this issue too. I've got ~1600 dimensional features. There's not really a good reason to use StaticArrays past certain size IMO.

Using 1000 dimensions, compilation times aren't too bad. The cutover for me is somewhere between that and 1600. My guess would be it's around 1024, which for 32 bit floats means a SVector would be more than a 4k page, and that's where things get wonky. Honestly, there's probably no performance benefit to static arrays beyond like 64 dimensions.

KristofferC commented 1 year ago

I'm running into this issue too. I've got ~1600 dimensional features. There's not really a good reason to use StaticArrays past certain size IMO.

True but there is probably no reason to use this package for these type of dimensions either since it won't really accelerate your search.

KristofferC / NearestNeighbors.jl

Compilation time issues with very high dimensions #156