DBAIWangGroup / nns_benchmark

Benchmark of Nearest Neighbor Search on High Dimensional Data
195 stars 50 forks source link

NNS Benchmark: Evaluating Approximate Nearest Neighbor Search Algorithms in High Dimensional Euclidean Space

Nearest neighbor search (NNS) is a fundamental and essential operation in applications from many domains, such as databases, machine learning, multimedia, and computer vision. Hundreds of algorithms have been proposed to attack the problem from different perspectives, yet there is no open and comprehensive comparison. By "comprehensive", we mean using state-of-the-art algorithms proposed from different research areas, and evaluating them on a diverse range of datasets.

To aid researchers and practitioners working on or whose work depends on the problem, we setup this benchmark for Nearest Neighbor Search (NNS) based on the Euclidean distance on High Dimensional Data. The benefit is twofold:

We also would like to leverage the entire community to collaboratively build and maintain this benchmark. For example, submitting new algorithms, useful datasets, or offering suggestions or improvements.

Scope

We limit the scope of this benchmark by imposing the following constraints.

Currently, we evaluate 15 representative NNS algorithms on 20 datasets where details are reported in our Nearest Neighbor Search (NNS) Experimental Evaluation Paper[1].

Algorithms Evaluted

Thanks to the original authors, all the algorithms considered in this benchmark have their sources publicly available. Algorithms are implemented in C++ unless otherwise specified. We carefully go through all the implementations and make necessary modifications to for fair comparisons. For instance, we re-implement the search process of some algorithms in C++. We also disable the multi-threads, SIMD instructions, fast-math, and hardware prefetching technique.

Below are brief introuductions of the algorithms evaluated in the benchmark as well as our revisions, which are grouped into three/four categories.

1. Locality Sensitive Hashing (LSH)-Based Methods

2. Space Partitioning-based Methods

2.1. Encoding-Based Space Partitioning Methods

2.2. Tree-Based Space Partitioning Methods

3. Neighborhood-Based Methods

Datasets Used

Currently, we use

Table 1 summarizes the characteristics of the datasets including the number of data points (n), dimensionality (d), Relative Contrast (RC [18]), local intrinsic dimensionality (LID [19]), and data type, where RC and LID are used to describe the hardness of the datasets.

Name n X 10^3 d RC LID Type
Nus* 269 500 1.67 24.5 Image
Gist* 983 960 1.94 18.9 Image
Rand* 1,000 100 3.05 58.7 Synthetic
Glove* 1,192 100 1.82 20.0 Text
Cifa 50 512 1.97 9.0 Image
Audio 53 192 2.97 5.6 Audio
Mnist 69 784 2.38 6.5 Image
Sun 79 512 1.94 9.9 Image
Enron 95 1,369 6.39 11.7 Text
Trevi 100 4,096 2.95 9.2 Image
Notre 333 128 3.22 9.0 Image
Yout 346 1,770 2.29 12.6 Video
Msong 922 420 3.81 9.5 Audio
Sift 994 128 3.50 9.3 Image
Deep 1,000 128 1.96 12.1 Image
Ben 1,098 128 1.96 8.3 Image
Imag 2,340 150 2.54 11.6 Image
Gauss 2,000 512 3.36 19.6 Synthetic
UQ-V 3,038 256 8.39 7.2 Video
BANN 10,000 128 2.60 10.3 Image

Table 1: Dataset Summary

We mark the first four datasets in Table 1 with asterisks to indicate that they are hard datasets compared with others.

Below, we give more descriptions of these datasets.

PERFORMANCE EVALUATION and ANALYSES

Please refer to our NNS Experimental Evaluation paper[1].

License

Our own code is released under the Apache License Version 2.0. Copyright is owned by DBWang Group (University of New South Wales, Australia) and Database group at QCIS, UTS (Centre for Quantum Computation & Intelligent Systems, The University of Technology Sydney, Australia).

Below are the license information for the included implementations:

  1. KGraph: BSD license. Users are still encouraged to download the binary distributions from its home site so building issues can be avoided.

  2. NMSLib and Annoy: Apache License Version 2.0.

  3. AGH: License information below

Terms of Use
--
Copyright (c) 2009-2011 by
--
DVMM Laboratory
Department of Electrical Engineering
Columbia University
Rm 1312 S.W. Mudd, 500 West 120th Street
New York, NY 10027
USA
--
If it is your intention to use this code for non-commercial purposes, such as in academic research, this code is free.
--
If you use this code in your research, please acknowledge the authors, and cite our related publication:
--

Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang, "Hashing with Graphs," International Conference on Machine Learning (ICML), Bellevue, Washington, USA, 2011
  1. SRS: GPL License.

  2. FLANN: BSD License.

  3. RCT: The Authors grant us the permission to release source code by email.

  4. Algorithms whose license information are not mentioned: NSH, OPQ, QALSH, SGH, and SH.

Recall vs Speedup

Here, we present the Recall with respect to Speedup of all the algorithms on four datasets, the k is set to be 20.

image Sift image Gist image Glove image MillionSong

REFERENCES

[1] W. Li, Y. Zhang, Y. Sun, W. Wang, W. Zhang, X. Lin, Nearest Neighbor Search on High Dimensional Data — Experiments, Analyses, and Improvement (v1.0). CoRR, vol. abs/1610.02455, 2016. Online version

[2] Q. Huang, J. Feng, Y. Zhang, Q. Fang, and W. Ng. Query-aware locality-sensitive hashing for approximate nearest neighbor search. PVLDB, 9(1):1–12, 2015.

[3] Y. Sun, W. Wang, J. Qin, Y. Zhang, and X. Lin. SRS: solving c-approximate nearest neighbor queries in high dimensional euclidean space with a tiny index. PVLDB,8(1):1–12, 2014

[4] Q. Jiang and W. Li. Scalable graph hashing with feature transformation. In IJCAI, pages 2248–2254, 2015.

[5] W. Liu, J. Wang, S. Kumar, and S. Chang. Hashing with graphs. In ICML, pages 1–8, 2011.

[6] Y. Park, M. J. Cafarella, and B. Mozafari. Neighbor-sensitive hashing. In PVLDB, 9(3):144–155, 2015.

[7] J. Gao, H. V. Jagadish, B. C. Ooi, and S. Wang. Selective hashing: Closing the gap between radius search and k-nn search. In SIGKDD, 2015.

[8] T. Ge, K. He, Q. Ke, and J. Sun. Optimized product quantization. IEEE Trans. Pattern Anal. Mach. Intell., 36(4):744–755, 2014.

[9] B. Naidan, L. Boytsov, and E. Nyberg. Permutation search methods are efficient, yet faster search is possible. PVLDB, 8(12):1618–1629, 2015.

[10] M. Muja and D. G. Lowe. Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans. Pattern Anal. Mach. Intell., 36(11):2227–2240, 2014.

[11] E. Bernhardsson. Annoy at github.

[12] L. Boytsov and B. Naidan. Learning to prune in metric and non-metric spaces. In NIPS, 2013.

[13] Y. Malkov, A. Ponomarenko, A. Logvinov, and V. Krylov. Approximate nearest neighbor algorithm based on navigable small world graphs. Inf. Syst., 45:61–68, 2014

[14] M. E. Houle and M. Nett. Rank-based similarity search: Reducing the dimensional dependence. IEEE TPAMI, 37(1):136–150, 2015.

[15] W. Dong. Kgraph website.

[16] W. Dong, M. Charikar, and K. Li. Efficient k-nearest neighbor graph construction for generic similarity measures. In WWW, 2011.

[17] A. Babenko and V. S. Lempitsky. The inverted multi-index. In CVPR, pages 3069–3076, 2012.

[18] J. He, S. Kumar, and S. Chang. On the difficulty of nearest neighbor search. In ICML, 2012.

[19] L. Amsaleg, O. Chelly, T. Furon, S. Girard, M. E. Houle, K. Kawarabayashi, and M. Nett. Estimating local intrinsic dimensionality. In SIGKDD, 2015.

[20] Yu. A. Malkov and D. A. Yashunin Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs, In arXiv, 2016

[21] C. Fu, C. Wang, and D. Cai. Fast approximate nearest neighbor search with navigating spreading-out graphs. In VLDB, pages 461- 474, 2019.