Outliers and knn performance with full_rec_dim

c42f commented 8 years ago

I'm using KDTrees for implementing an iterative closest point algorithm for point cloud matching. knn with k=1 works very well for this, but I've found that there can be a large performance hit when the two point clouds don't sample the same part of the geometry.

Part of the problem seems to be that the low dimensional approximation turned on with full_rec_dim isn't a good tradeoff in this case. Here's a test case:

using KDTrees

# Find nearest neighbour in ref_tree for each point in query_points
function matchall(query_points, ref_tree, full_rec_dim)
    match_inds = zeros(Int, size(query_points,2))
    for i=1:size(query_points,2)
        inds,dists = knn(ref_tree, query_points[:,i], 1, full_rec_dim)
        match_inds[i] = inds[1]
    end
    match_inds
end

N = 100000

# Reference point cloud: random points in the plane
ref_points = zeros(3,N)
ref_points[1:2,:] = rand(2,N)
ref_tree = KDTree(ref_points)

# Random point cloud aligned spatially with ref_points
aligned_points = zeros(3,N)
aligned_points[1:2,:] = rand(2,N)

# Point cloud misaligned by adding some translation vector
misaligned_points = aligned_points .+ [0.1, 0.1, 0.1]

for full_rec_dim in (0,6)
    @show full_rec_dim
    matchall(aligned_points, ref_tree, full_rec_dim) # warm up jit

    println("Time for query with aligned clouds")
    @time matchall(aligned_points, ref_tree, full_rec_dim)

    println("Time for query with misaligned clouds")
    @time matchall(misaligned_points, ref_tree, full_rec_dim)
end

Example output:

$ julia knn_perf.jl 
full_rec_dim = 0
Time for query with aligned clouds
  0.105200 seconds (698.98 k allocations: 32.791 MB, 1.66% gc time)
Time for query with misaligned clouds
  0.103588 seconds (698.98 k allocations: 32.791 MB, 2.52% gc time)
full_rec_dim = 6
Time for query with aligned clouds
  0.061226 seconds (698.98 k allocations: 32.791 MB, 3.88% gc time)
Time for query with misaligned clouds
  2.713287 seconds (698.98 k allocations: 32.791 MB, 0.04% gc time)

KristofferC commented 8 years ago

That's interesting. I was worried that only benchmarking on random numbers would come back to bite me. I was going to suggest moving to https://github.com/KristofferC/NearestNeighbors.jl where I actually removed the whole full_rec_dim deal since it lead to more confusing code for not that much benefit.

I now see that I have an intermittent test failure in NearestNeighbors.jl now, see: https://ci.appveyor.com/project/KristofferC/nearestneighbors-jl/build/job/uxdqeus8utx7nsiq

I am pretty sure that is with the Chebyshev norm though so if you use Euclidean norm it should be fine. I'll try find that bug now...

KristofferC commented 8 years ago

Ok, the test failures are actually only when using the Cityblock() metric..

c42f commented 8 years ago

Ah right, I wasn't aware of NearestNeighbors.jl, I'm happy to use it instead. I assume KDTrees.jl will be deprecated at some stage soon?

I found a good way to debug what's happening is to measure the branching factor of the search algorithm. Here's output from the same test as above, with a lightly instrumented version of KDTrees

full_rec_dim = 0
Time for query with aligned clouds
  0.151551 seconds (2.27 M allocations: 56.712 MB, 2.29% gc time)
  Branch factor = 1.0504652858643764
Time for query with misaligned clouds
  0.147663 seconds (2.31 M allocations: 57.310 MB, 1.99% gc time)
  Branch factor = 1.0611536573003453
full_rec_dim = 6
Time for query with aligned clouds
  0.101380 seconds (2.27 M allocations: 56.801 MB, 2.30% gc time)
  Branch factor = 1.0523643105433726
Time for query with misaligned clouds
  4.556464 seconds (79.37 M allocations: 1.204 GB, 1.08% gc time)
  Branch factor = 1.866922484169002

Here I'm measuring the branching factor as (near_subtree_count + far_subtree_count)/near_subtree_count where I count the number of times we recurse into each subtree.

KristofferC commented 8 years ago

Yes my plan is to deprecate KDTrees.jl and point to NearestNeighbors.jl shortly.

In NearestNeighbors.jl I have some (undocumented) statistics methods, see: https://github.com/KristofferC/NearestNeighbors.jl/blob/master/src/debugging.jl

which basically says that if you set the environment variable NN_DEBUG before using NearestNeighbors then it will activate some statistics macros that I put into various functions. Without the NN_DEBUG variable these macros are nops.

I currently do some basic counting on how many nodes are visited but it could be expanded to give much richer statistics.

KristofferC commented 8 years ago

Finally found that bug... https://github.com/KristofferC/NearestNeighbors.jl/commit/76a5c4bdd2cd0086267f74ab3d28852fb9860e82 that took waaay to long.

I tried your code on NearestNeighbors with

using NearestNeighbors

# Find nearest neighbour in ref_tree for each point in query_points
function matchall(query_points, ref_tree)
    match_inds = NearestNeighbors.knn(ref_tree, query_points, 1)
end

N = 100000

# Reference point cloud: random points in the plane
ref_points = zeros(3,N)
ref_points[1:2,:] = rand(2,N)
ref_tree = NearestNeighbors.KDTree(ref_points)

# Random point cloud aligned spatially with ref_points
aligned_points = zeros(3,N)
aligned_points[1:2,:] = rand(2,N)

# Point cloud misaligned by adding some translation vector
misaligned_points = aligned_points .+ [0.1, 0.1, 0.1]

matchall(aligned_points, ref_tree) # warm up jit

println("Time for query with aligned clouds")
@time matchall(aligned_points, ref_tree)

println("Time for query with misaligned clouds")
@time matchall(misaligned_points, ref_tree);

and get:

Time for query with aligned clouds
  0.073362 seconds (300.01 k allocations: 19.837 MB, 5.76% gc time)
Time for query with misaligned clouds
  0.069745 seconds (300.01 k allocations: 19.837 MB, 7.35% gc time)

compared to KDTrees.jl


full_rec_dim = 0
Time for query with aligned clouds
  0.188462 seconds (698.98 k allocations: 32.791 MB, 19.04% gc time)
Time for query with misaligned clouds
  0.137693 seconds (698.98 k allocations: 32.791 MB, 1.89% gc time)
full_rec_dim = 6
Time for query with aligned clouds
  0.073093 seconds (698.98 k allocations: 32.791 MB, 3.36% gc time)
Time for query with misaligned clouds
  3.126432 seconds (698.98 k allocations: 32.791 MB, 0.04% gc time)

c42f commented 8 years ago

if you set the environment variable NN_DEBUG before using NearestNeighbors then it will activate some statistics macros

Nice, stats like that are exactly what I needed for chasing the KDTrees performance weirdness I was seeing. I'll move over to NearestNeighbors.jl very soon, thanks.

Maybe the thing to come out of this issue is that performance tuning is subtle, and it would be worth having some test cases simulating various nasty things which occur in real data sets. So far I've seen the following:

Queries points can systematically be outliers relative to the data in the tree (as discussed above)
Data in the tree can reside on subspaces in the higher dimensional space (makes the outlier/inlier problem worse)
Outliers in the tree data mess up the tree building heuristic, and can result in a lot of very long skinny hyperrectangles. I don't see how to avoid this while maintaining the balanced tree which is assumed in the implementation.

c42f commented 8 years ago

Those performance numbers with NearestNeighbors.jl look really promising. When do you anticipate tagging the first public version? Oh, I see... you only released the first version a few days ago, no wonder I didn't know about it.

What's the relationship with the other package of the same name https://github.com/johnmyleswhite/NearestNeighbors.jl ? Seems the implementations are very different under the hood.

KristofferC commented 8 years ago

First public release got tagged but you might want to be on master anyway. I now changed the way the debug flag is active, from an environment variable to a global variable in the main package file. The environment variable way didn't work well with package precompilation because it was conditionally evaling stuff. I put up some small documentation about it too.

It would be really good to have some proper benchmarks for a variety of (standardized?) data sets and compare it to a well used, respected NN package.

I think the difference between my package and the other one is that mine is a bit more feature complete and has been given a bit more care when designing the data structures to get good performance and reduce memory usage.

Other

julia> @time KDTree(data)
  3.142755 seconds (27.42 M allocations: 601.563 MB, 25.09% gc time)

julia> @time for i = 1:10^4  nearest(tree, rand(5), 5) end
  1.862179 seconds (238.59 k allocations: 8.371 MB)

Mine:

julia> @time KDTree(data);
  0.601531 seconds (100.02 k allocations: 59.510 MB, 9.61% gc time)

julia> @time knn(tree, rand(5, 10^4), 5);
  0.138314 seconds (30.02 k allocations: 2.976 MB)

c42f commented 8 years ago

Yes, I saw quite a bit of care went into the layout of the data structures in KDTrees, it's hard to imagine getting much more memory efficient there. The numbers speak for themselves!

A C++ implementation that I like is nanoflann.hpp - might be worth looking at as a comparison point. I don't really know if it's among the most efficient knn libraries around, but I've used it and the implementation looks really solid to me. It does most of the obvious efficiency hacks I'd consider in C++ (type specialization based on metric and optionally the vector length; pooled tree node allocator).

KristofferC commented 8 years ago

In KDTree i realized I actually wasted some memory by saving the full hyper rectangles for each node instead of just the high and low value for each splitting dimension. I changed that in NN.jl so I just store the full hyperrectangle for the original point cloud and then for each tree node I just have a KDNode type: https://github.com/KristofferC/NearestNeighbors.jl/blob/master/src/kd_tree.jl#L1 which saves the high, low, and split value together with the split dimension. I hoped it would help a bit with performance since it avoids pulling a bunch of unnecessary values into the cache line. In practice it didn't seem to do that much difference though.

My main problem right now with regards to performance is that I don't actually know what the bottle neck is and what I should focus optimizing. Is it evaluating the distance function? Is it the cpu stalling while waiting for memory? Is there significant overhead in unpacking the values from the immutables in each recursion call or does the compiler realize that those values are constant etc etc?

Some time ago I tried some profiling by building Julia with Intel's compiler and then using VTune. It's quite powerful because you can see a bunch of profiling data taken straight from the cpu. I should probably try that again.

Thanks for the link to nanoflann, I hadn't seen it before. Running something like

cloud = rand(3, 10^6)
tree = KDTree(cloud)
(@elapsed knn(tree, rand(3, 10^6), 1)) / 10^6

# 1.939144678e-6

seems I do about 1 search per 2 microseconds for a 10^6 3D point cloud which seems to be in the same ballpark as their figure at https://github.com/jlblancoc/nanoflann#32-benchmark-original-flann-vs-nanoflann

When Base threading support get mature enough it would be cool if I could add some threading to the tree building and the searching. The tree builder shouldn't be too hard to thread I think (just do a few iterations on one thread and then throw each sub tree to a different thread).

If you have any other ideas or suggestions please let me know :)

c42f commented 8 years ago

Maybe you could get something useful from linux perf events if you don't want to mess with icc and VTune? Last time I looked briefly, perf seemed like it might be very impressive, but also with quite a learning curve. Maybe the tools have improved since then though.

Maybe I should move some of this to a "performance benchmarking" issue in NearestNeighbors.jl and close this one?

By the way, I think you forgot to push your tag to NearestNeighbors.jl? I see it registered in METADATA.jl (thanks!) but the tag isn't public. I've moved my ICP code to NearestNeighbors and it's definitely better behaved for outliers now.

KristofferC commented 8 years ago

Thanks for telling me about the missing tag. I hopefully fixed it now.

I had some troubles getting VTune work well again but I didn't spend that much time on it. If I can't get it to work I will look at perf but I remember VTune having a quite nice and intuitive interface.

Like you say, let's move further discussion to NearestNeighbors.jl

c42f commented 8 years ago

Closing in favor of KristofferC/NearestNeighbors.jl#7 . Thanks for all the information!

JuliaGeometry / KDTrees.jl

Outliers and knn performance with full_rec_dim #20