Closed harsha-simhadri closed 5 years ago
Since our work is mainly focus on high-dimensional large-scale data, we didn't test the algorithm on any datasets who's dimension is less than 96(Deep1E).
According to our experiments, there will be those variables that influence the algorithm's performance:
I suggest you trying a wider KNN graph, i.e. 200-NN or 300-NN graph, 100-NN graph maybe not wide enough. Also a larger L may help. BTW, we update the code today, the old code is outdated, you can have a try with this new version.
@DelightRun Thanks for the comments. To make sure I understand your comments:
For reference, here's the NSG parameters we used in our 1-million 128-D random dataset(also generate uniformly) experiment:
A small trick is that, you can use any other algorithms to generate KNN graph. You can even use GPU to generate accurate KNN graph instead of NNDescent, this is can save a lot of time. Here is my gist snippet for accurate KNN graph construction using FAISS and GPU: https://gist.github.com/DelightRun/e8bc139bcd2c588bdaea6652da4ff4af
OK, I will tune with these parameters.
I am starting with a 99% accurate kNN graph from NN-descent. Do you think 99% vs 100% KNN start graph makes a difference?
In our experiments, almost no difference.
I started with 500-NN for a 30-dim dataset of 100K random points on unit sphere. With L=2000, R=200, C=20000, and query set of 1K random points on unit sphere (Ls=500), recall@20 is 12.4, which is not much better than previous configurations. I got similar results -- recall@100 saturating at 82 -- for another 100 dimensional dataset with 2.5M points; hnsw and FAISS get to 98 recall@100 for this dataset. Is it possible that the NSG graph is disconnected or not navigable for some datasets and dimensions?
It's very hard to say, but it does possible that the graph is disconnected. You can send me your dataset (including base, query and groundtruth) and let me have a try, if it is convenient for you.
Here are the random 30-dim base (100k) and query (1k) datasets in a zipfile. Thanks for offering to look at it. The 100-dim dataset is private.
I repeated the same experiment for random points in 32 dimensions. The recall@20 is 20. I think there is a power of 2 bug in your code. The code seems to work well when number of dimension is 2^d, but is broken for other dimensions
Oh, I forgot to talk about that. Since the distance computation is using AVX, so you need to align the memory manually, i.e. align the dimension to multiple of 8 by padding zeros. I will add the related memory align code to this repo soon.
I generated a set of n points which are drawn uniformly at random from the unit D-dim sphere. I generated a query dataset from the same distribution with n/10 points. I used nndescent from efanna to generate approx-NN graph (with 98%recall@100) and then used NSG index construction with L=200, R=200, C=600. I used NSG search with Ls=500, K=100.
The recall@100 for D=128 dimensions is in the high 90s, which is good. However, recall@100 for D=30 dimensions does not go past 67%. I have tuned most of the parameters exposed in the command line. This problem also arises with grid graphs in low dimensions and other d<100 dimension datasets we have internally.
Do you face similar problems or are you able to get high recall for random D<50 dimensional base and query datasets?
Thanks!