Closed c42f closed 8 years ago
That's interesting. I was worried that only benchmarking on random numbers would come back to bite me. I was going to suggest moving to https://github.com/KristofferC/NearestNeighbors.jl where I actually removed the whole full_rec_dim
deal since it lead to more confusing code for not that much benefit.
I now see that I have an intermittent test failure in NearestNeighbors.jl
now, see: https://ci.appveyor.com/project/KristofferC/nearestneighbors-jl/build/job/uxdqeus8utx7nsiq
I am pretty sure that is with the Chebyshev
norm though so if you use Euclidean
norm it should be fine. I'll try find that bug now...
Ok, the test failures are actually only when using the Cityblock()
metric..
Ah right, I wasn't aware of NearestNeighbors.jl, I'm happy to use it instead. I assume KDTrees.jl will be deprecated at some stage soon?
I found a good way to debug what's happening is to measure the branching factor of the search algorithm. Here's output from the same test as above, with a lightly instrumented version of KDTrees
full_rec_dim = 0
Time for query with aligned clouds
0.151551 seconds (2.27 M allocations: 56.712 MB, 2.29% gc time)
Branch factor = 1.0504652858643764
Time for query with misaligned clouds
0.147663 seconds (2.31 M allocations: 57.310 MB, 1.99% gc time)
Branch factor = 1.0611536573003453
full_rec_dim = 6
Time for query with aligned clouds
0.101380 seconds (2.27 M allocations: 56.801 MB, 2.30% gc time)
Branch factor = 1.0523643105433726
Time for query with misaligned clouds
4.556464 seconds (79.37 M allocations: 1.204 GB, 1.08% gc time)
Branch factor = 1.866922484169002
Here I'm measuring the branching factor as (near_subtree_count + far_subtree_count)/near_subtree_count
where I count the number of times we recurse into each subtree.
Yes my plan is to deprecate KDTrees.jl
and point to NearestNeighbors.jl
shortly.
In NearestNeighbors.jl
I have some (undocumented) statistics methods, see: https://github.com/KristofferC/NearestNeighbors.jl/blob/master/src/debugging.jl
which basically says that if you set the environment variable NN_DEBUG
before using NearestNeighbors
then it will activate some statistics macros that I put into various functions. Without the NN_DEBUG
variable these macros are nops.
I currently do some basic counting on how many nodes are visited but it could be expanded to give much richer statistics.
Finally found that bug... https://github.com/KristofferC/NearestNeighbors.jl/commit/76a5c4bdd2cd0086267f74ab3d28852fb9860e82 that took waaay to long.
I tried your code on NearestNeighbors
with
using NearestNeighbors
# Find nearest neighbour in ref_tree for each point in query_points
function matchall(query_points, ref_tree)
match_inds = NearestNeighbors.knn(ref_tree, query_points, 1)
end
N = 100000
# Reference point cloud: random points in the plane
ref_points = zeros(3,N)
ref_points[1:2,:] = rand(2,N)
ref_tree = NearestNeighbors.KDTree(ref_points)
# Random point cloud aligned spatially with ref_points
aligned_points = zeros(3,N)
aligned_points[1:2,:] = rand(2,N)
# Point cloud misaligned by adding some translation vector
misaligned_points = aligned_points .+ [0.1, 0.1, 0.1]
matchall(aligned_points, ref_tree) # warm up jit
println("Time for query with aligned clouds")
@time matchall(aligned_points, ref_tree)
println("Time for query with misaligned clouds")
@time matchall(misaligned_points, ref_tree);
and get:
Time for query with aligned clouds
0.073362 seconds (300.01 k allocations: 19.837 MB, 5.76% gc time)
Time for query with misaligned clouds
0.069745 seconds (300.01 k allocations: 19.837 MB, 7.35% gc time)
compared to KDTrees.jl
full_rec_dim = 0
Time for query with aligned clouds
0.188462 seconds (698.98 k allocations: 32.791 MB, 19.04% gc time)
Time for query with misaligned clouds
0.137693 seconds (698.98 k allocations: 32.791 MB, 1.89% gc time)
full_rec_dim = 6
Time for query with aligned clouds
0.073093 seconds (698.98 k allocations: 32.791 MB, 3.36% gc time)
Time for query with misaligned clouds
3.126432 seconds (698.98 k allocations: 32.791 MB, 0.04% gc time)
if you set the environment variable NN_DEBUG before using NearestNeighbors then it will activate some statistics macros
Nice, stats like that are exactly what I needed for chasing the KDTrees performance weirdness I was seeing. I'll move over to NearestNeighbors.jl very soon, thanks.
Maybe the thing to come out of this issue is that performance tuning is subtle, and it would be worth having some test cases simulating various nasty things which occur in real data sets. So far I've seen the following:
Those performance numbers with NearestNeighbors.jl
look really promising. When do you anticipate tagging the first public version? Oh, I see... you only released the first version a few days ago, no wonder I didn't know about it.
What's the relationship with the other package of the same name https://github.com/johnmyleswhite/NearestNeighbors.jl ? Seems the implementations are very different under the hood.
First public release got tagged but you might want to be on master anyway. I now changed the way the debug flag is active, from an environment variable to a global variable in the main package file. The environment variable way didn't work well with package precompilation because it was conditionally evaling stuff. I put up some small documentation about it too.
It would be really good to have some proper benchmarks for a variety of (standardized?) data sets and compare it to a well used, respected NN package.
I think the difference between my package and the other one is that mine is a bit more feature complete and has been given a bit more care when designing the data structures to get good performance and reduce memory usage.
Other
julia> @time KDTree(data)
3.142755 seconds (27.42 M allocations: 601.563 MB, 25.09% gc time)
julia> @time for i = 1:10^4 nearest(tree, rand(5), 5) end
1.862179 seconds (238.59 k allocations: 8.371 MB)
Mine:
julia> @time KDTree(data);
0.601531 seconds (100.02 k allocations: 59.510 MB, 9.61% gc time)
julia> @time knn(tree, rand(5, 10^4), 5);
0.138314 seconds (30.02 k allocations: 2.976 MB)
Yes, I saw quite a bit of care went into the layout of the data structures in KDTrees, it's hard to imagine getting much more memory efficient there. The numbers speak for themselves!
A C++ implementation that I like is nanoflann.hpp - might be worth looking at as a comparison point. I don't really know if it's among the most efficient knn libraries around, but I've used it and the implementation looks really solid to me. It does most of the obvious efficiency hacks I'd consider in C++ (type specialization based on metric and optionally the vector length; pooled tree node allocator).
In KDTree i realized I actually wasted some memory by saving the full hyper rectangles for each node instead of just the high and low value for each splitting dimension. I changed that in NN.jl so I just store the full hyperrectangle for the original point cloud and then for each tree node I just have a KDNode type: https://github.com/KristofferC/NearestNeighbors.jl/blob/master/src/kd_tree.jl#L1 which saves the high, low, and split value together with the split dimension. I hoped it would help a bit with performance since it avoids pulling a bunch of unnecessary values into the cache line. In practice it didn't seem to do that much difference though.
My main problem right now with regards to performance is that I don't actually know what the bottle neck is and what I should focus optimizing. Is it evaluating the distance function? Is it the cpu stalling while waiting for memory? Is there significant overhead in unpacking the values from the immutables in each recursion call or does the compiler realize that those values are constant etc etc?
Some time ago I tried some profiling by building Julia with Intel's compiler and then using VTune. It's quite powerful because you can see a bunch of profiling data taken straight from the cpu. I should probably try that again.
Thanks for the link to nanoflann, I hadn't seen it before. Running something like
cloud = rand(3, 10^6)
tree = KDTree(cloud)
(@elapsed knn(tree, rand(3, 10^6), 1)) / 10^6
# 1.939144678e-6
seems I do about 1 search per 2 microseconds for a 10^6 3D point cloud which seems to be in the same ballpark as their figure at https://github.com/jlblancoc/nanoflann#32-benchmark-original-flann-vs-nanoflann
When Base threading support get mature enough it would be cool if I could add some threading to the tree building and the searching. The tree builder shouldn't be too hard to thread I think (just do a few iterations on one thread and then throw each sub tree to a different thread).
If you have any other ideas or suggestions please let me know :)
Maybe you could get something useful from linux perf events if you don't want to mess with icc and VTune? Last time I looked briefly, perf seemed like it might be very impressive, but also with quite a learning curve. Maybe the tools have improved since then though.
Maybe I should move some of this to a "performance benchmarking" issue in NearestNeighbors.jl and close this one?
By the way, I think you forgot to push your tag to NearestNeighbors.jl? I see it registered in METADATA.jl (thanks!) but the tag isn't public. I've moved my ICP code to NearestNeighbors and it's definitely better behaved for outliers now.
Thanks for telling me about the missing tag. I hopefully fixed it now.
I had some troubles getting VTune work well again but I didn't spend that much time on it. If I can't get it to work I will look at perf but I remember VTune having a quite nice and intuitive interface.
Like you say, let's move further discussion to NearestNeighbors.jl
Closing in favor of KristofferC/NearestNeighbors.jl#7 . Thanks for all the information!
I'm using KDTrees for implementing an iterative closest point algorithm for point cloud matching.
knn
with k=1 works very well for this, but I've found that there can be a large performance hit when the two point clouds don't sample the same part of the geometry.Part of the problem seems to be that the low dimensional approximation turned on with
full_rec_dim
isn't a good tradeoff in this case. Here's a test case:Example output: