erikbern / ann-benchmarks

Benchmarks of approximate nearest neighbor libraries in Python
http://ann-benchmarks.com
MIT License
4.86k stars 729 forks source link

knn-graph construction benchmark #207

Open frankier opened 3 years ago

frankier commented 3 years ago

Firstly, thank you very much for creating this benchmark. It is very helpful for navigating this exciting but somewhat overwhelming field.

I am interested in particular in the performance of constructing a knn-graph from data points. I wonder whether this would be an interesting additional benchmark. I believe the results might differ somewhat from the existing results, since:

  1. It would combine query performance and index construction in one measure.
  2. It would put techniques which create such a graph as part of their construction (like NN-descent) at an advantage.

What do you think?

maumueller commented 3 years ago

What measure do you suggest? You could edit https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/plotting/plot_variants.py to include different measures that relate build time and qps.

Note that the interactive website(e.g., http://ann-benchmarks.com/glove-100-angular_10_angular.html) might contain plots that are of interest here. One of them relates the recall to the index size/qps, relating size and query performance to each other. It would be easy to do that with build time as well, but it's a bit difficult to interpret.

erikbern commented 3 years ago

You could sort of infer it by adding the index build time and the extrapolated query time for the full dataset. In a way I like this metric because it encapsulates everything (whereas currently we ignore the index building time in the benchmarks, as long as the index completes within 1h).

frankier commented 3 years ago

It's true that the (extrapolated) KNN query time is very general and probably the most useful benchmark for most people, however KNN graph construction is also quite general e.g. it's the recommended way to do unsupervised learning when there's many high dimensional data points in scikit-learn. This includes including clustering and manifold learning. Many high dimensional points means basically anything which the build in BallTree method doesn't scale to -- which is quite a lot of things. Here's some background:

I don't think it's simply index construction time + NN query time x index size gives the knn graph construction time for some systems construction of some index amounts to constructing a knn graph, so this would penalise these systems. See for example pynndescent: https://github.com/lmcinnes/pynndescent/blob/master/pynndescent/pynndescent_.py#L1963 where the fit_transform(...) method shortcuts querying and just copies the knn graph out of the index. It does provide an upper bound however.

It is a bit annoying since it's becomes more difficult to show O(f(n))-type trends for varying index size. I would propose to ignore this issue and just use the existing data sets which already show some kind of spread of performances.

dkobak commented 2 years ago

Absolutely agree with @frankier that this would be a great additional feature. I am working on manifold learning / clustering, and what I am interested in is exactly the kNN graph construction time.

It can probably be approximately computed as "index build time + the extrapolated query time for the full dataset" as @erikbern suggested above, however for some algorithms such as PyNNDescent (@lmcinnes) one should not add the extrapolated query time, and simply take the index build time alone.

If we had a list of algorithms for which this applies, then it would be very easy to add the require plots (as @maumueller wrote).

maumueller commented 2 years ago

Thanks for bringing this up again @dkobak. I think I'm understanding @frankier problem description a bit better now. Isn't it exactly as @erikbern recommends the index building time + the (extrapolated) query time if the dataset would be the query set? This would be straight-forward to implement from the existing data and doesn't even require re-running experiments.

None of the graph-based algorithms build the actual knn graph AFAIK, so they same rule should apply for them as well.

frankier commented 2 years ago

As said, PyNNDescent does build the knn graph, but it might be that most of the others don't. It would be nice to know whether to use PyNNDescent or another option in this case though.

It could also be that some have cheaper queries when some batching is applied or similar so that building the knn graph ends up cheaper.

maumueller commented 2 years ago

I would be very surprised if that statement about PyNNDescent is true, @frankier. @lmcinnes: can you confirm it? In my world, you don't get a well-performing graph-based search if you don't prune neighbors.

dkobak commented 2 years ago

I can confirm that PyNNDescent builds an actual kNN graph during build time. If you build it with k neighbors and later only want to have k (or fewer) neighbors, then you don't need to query() at all. That's how PyNNDescent is used in openTSNE (which I am more familiar with): https://github.com/pavlin-policar/openTSNE/blob/master/openTSNE/nearest_neighbors.py#L520 and also in UMAP.

It seems that the way PyNNDescent is run in ann-benchmarks is to use query() all the time. But you wouldn't actually do it in practice if you want to build a kNN graph (at least openTSNE/UMAP do not).

That said, I am not sure how exactly PyNNDescent is run here to obtain different recall values.

Would be great if @lmcinnes could clarify.

lmcinnes commented 2 years ago

PyNNDescent was built with knn-graph construction in mind as a use-case. That means that the index construction is actually split into two parts, with the second part optional. The first phase builds an (approximate) knn-graph. You can stop at that point and simply use that if the knn-graph was the intended result. The second phase (induced by a call to prepare, or a query on an unprepared index) does a bunch of graph pruning and optimization (as well as JIT compiling the actual query function). So yes, all queries run on a pruned version of the graph. On the other hand the full graph is available earlier if desired.

In the current ann-benchmarks set up I have pynndescent calling prepare during index construction (since the benchmarks are query oriented, and it is sensible to ensure this phase is included in the index construction time if query performance is your goal). If, however, you are simply interested in knn-graph construction then you can get that from pynndescent in less time than even the index construction time (since the prepare phase isn't required, but is a non-trivial part of the index construction time).

dkobak commented 2 years ago

Thanks Leland, that's very helpful. Do you know if any other algorithms currently included in ann-benchmarks construct kNN graph when building the index?

lmcinnes commented 2 years ago

I know that kgraph does. My understanding is that the various NGT algorithms do (and cache it). The bottom layer of an HNSW index is an approximation to one, and with suitably chosen parameters (to avoid over-pruning -- assuming those parameters are exposed by the implementation) I believe you might be able to induce a reasonable knn-graph from it. I would have to check the details on implementation, but at least in principle vamana (based on NSG) should be able to produce a knn at index-build time, again by skipping or tuning away the pruning.

I think if you wanted to create a different benchmark for knn-graph construction (and measure accuracy thereof) you could probably put together and tune a number of algorithms that could do it significantly faster than index build + query time. It is definitely a non-trivial amount of work to get that set up and the algorithms tuned of course.

dkobak commented 2 years ago

Thanks again. All good points.

vikigenius commented 1 year ago

Any updates on this? I am deciding between nndescent and hnsw for KNN graph construction, and it would be helpful if we had benches for the actual graph construction. Anybody else has any personal experience or benches would also be helpful.

flying-sheep commented 1 year ago

So the Recall/Build time graph on https://ann-benchmarks.com/glove-100-angular_10_angular.html says that index building on faiss-ivf and bruteforce-blas is many orders of magnitude faster than the others? Or is there more to it?

maumueller commented 1 year ago

In general, faiss/scann have faster building times, but the plots are a bit misleading: An implementation would score well even though queries run extremely slow because this is not taken into account in this comparison.

I'm not sure how to plot these three parameters (recall/throughput/build time) in a good way.