some questions about NSG

half-empty commented 5 years ago

Find the approximate medoid of the dataset by search on the kNN graph with greedy best-first search algorithm(algorithm 1). Why not compare all other nodes to choice the closest node?
The dataset occupies large memory, and the PQ's metric is mentioned in the code. Will this be added in the future?
Is NSG-Naive much faster than NSG at indexing stage? Have you compared the memory space, data-pre-processing time, precision and response scenarios ?
Have you ever used the kNN graph of LargeVis? And compare with Faiss?
Incremental indexing is mentioned in the paper, how to do it? Now, it can only split the dataset and update one partition to avoid full updates?

Best wishes.

fc731097343 commented 5 years ago

Because comparing all the nodes is very time-consuming on large datasets.
Enhancing the NSG with PQ or Scalar Quantization will be considered in the future soon.
Yes, NSG-Naive is much faster at indexing, the memory occupancy is the same as NSG. See our paper for the details on the precision.
LargeVis's kNN graph construction algorithm is much slower than nn-descent (kGraph). In our study, kGraph is the fastest on CPU on small datasets. But Faiss is the fastest if GPU is used.
Incremental indexing can be done by searching for the nearest neighbors for the new node and perform pruning. When updating neighbors, we need to consider the new node and the nearest neighbors of the new node because the new node can be a potential new neighbor of old nodes. The efficiency of this process should be carefully evaluated, then we will add it to this lib as a new feature.

half-empty commented 5 years ago

Thanks.

The time complexity of "compare the centroid with all other nodes to choice the closest node" is O(n). It's much large than greedy best-first search algorithm(algorithm 1), but it's much smaller than "search-collect-select". It can be accepted, if it can get better performance.
About point 2 and 5, I'm looking forward to it. :)

half-empty commented 5 years ago

看完源码后，还有两个问题。

greedy best-first search并没有完全按照论文中的描述实现，是一种类似DFS+剪枝的做法。但搜索范围是否太宽了？因为在源码中是将查询数据与初始中心的已选邻点进行比较，将最远距离作为阈值。这样相当于是以查询数据为中心，最远距离作为半径的一个超球体作为最大限定范围的一种查询，但是这个最远距离很可能很大，导致搜索实际上遍历了大量的点。这是为了增加精度做的一种优化吗？因为实际上高维空间中的相似度传递性质其实很差。
2个月前的一次提交把保证连通性的DFS生成树tree_grow部分给注释掉了，只增加了通过边的双向性来完善节点的邻点。不再使用DFS的原因是什么？

如果代码理解有误，请告知，谢谢。

fc731097343 commented 5 years ago

这个实现是按照论文中的描述实现的，这个类似DFS的东西跟论文说的greedy best-first主要思想是一致的，论文为了让这个算法看起来更具通用性和普适性，省略了一些小的细节，现在代码的实现版本其实是优化过的greedy best-first，相对来说是更快更准。并不是用一个最远半径来做限制，而是按论文说的，是从导航点出发，收集到query点附近的所有潜在候选点，在这里选择点来筛选，所遍历点数规模是很大的，才能保证论文所描述的相应属性。这是为了尽可能去近似我们提出的MRNG。
目前是我的学弟在维护这部分代码，之前一直让他把最新的版本push上来，事实上手头的最新版本，是这次最新论文用的版本，可以在deep100M上运行。但他还在继续优化tree grow的部分，而且他最近似乎有事缠身（要毕业了），所以请稍安勿躁，近期的投稿期过了他应该会好好整理一下，但本质上相差不大，最新的版本就是优化了在超大规模数据上产生的高内核陷入和频繁大内存申请释放，而导致的性能问题，原理上未做改动。

half-empty commented 5 years ago

多谢解答，之前看到search和get_neighbors代码很类似时没有细看。现在看到在search中去掉了"if(L+1 < retset.size()) ++L;"，也就是构图阶段的遍历点数规模很大，但是搜索阶段并不大，L并不会扩增，效率极高。

ZJULearning / nsg