Recall for rand dataset with d=30

harsha-simhadri commented 5 years ago

I generated a set of n points which are drawn uniformly at random from the unit D-dim sphere. I generated a query dataset from the same distribution with n/10 points. I used nndescent from efanna to generate approx-NN graph (with 98%recall@100) and then used NSG index construction with L=200, R=200, C=600. I used NSG search with Ls=500, K=100.

The recall@100 for D=128 dimensions is in the high 90s, which is good. However, recall@100 for D=30 dimensions does not go past 67%. I have tuned most of the parameters exposed in the command line. This problem also arises with grid graphs in low dimensions and other d<100 dimension datasets we have internally.

Do you face similar problems or are you able to get high recall for random D<50 dimensional base and query datasets?

Thanks!

DelightRun commented 5 years ago

Since our work is mainly focus on high-dimensional large-scale data, we didn't test the algorithm on any datasets who's dimension is less than 96(Deep1E).

According to our experiments, there will be those variables that influence the algorithm's performance:

the width(the number of neighbors) and recall of KNN graph, especially, a very small width may break the connectivity of KNNG, which will cause recall's upper bound is very low.
the max out-degree limit R, which should be large enough to guarantee a high recall, but too large R will decrease the search performance.

I suggest you trying a wider KNN graph, i.e. 200-NN or 300-NN graph, 100-NN graph maybe not wide enough. Also a larger L may help. BTW, we update the code today, the old code is outdated, you can have a try with this new version.

harsha-simhadri commented 5 years ago

@DelightRun Thanks for the comments. To make sure I understand your comments:

I start with 300-NN approx-NN graph from efanna's nndescent
For nsg_index, I set L larger than 200. How large do you suggest? 500? 1000? What about R and C?

DelightRun commented 5 years ago

For L, you can do some grid search to get a good value.
For R, actually it depends on the intrinsic dimension of the dataset. I think you can try 30, 50, 100, 200 respectively.
For C, in most case it is useless for million-scale dataset, it is designed for very very large dataset such as Deep1B. You can just use a large number such as 10000.

For reference, here's the NSG parameters we used in our 1-million 128-D random dataset(also generate uniformly) experiment:

L = 2000
R = 100
C = 20000

A small trick is that, you can use any other algorithms to generate KNN graph. You can even use GPU to generate accurate KNN graph instead of NNDescent, this is can save a lot of time. Here is my gist snippet for accurate KNN graph construction using FAISS and GPU: https://gist.github.com/DelightRun/e8bc139bcd2c588bdaea6652da4ff4af

harsha-simhadri commented 5 years ago

OK, I will tune with these parameters.

I am starting with a 99% accurate kNN graph from NN-descent. Do you think 99% vs 100% KNN start graph makes a difference?

DelightRun commented 5 years ago

In our experiments, almost no difference.

harsha-simhadri commented 5 years ago

I started with 500-NN for a 30-dim dataset of 100K random points on unit sphere. With L=2000, R=200, C=20000, and query set of 1K random points on unit sphere (Ls=500), recall@20 is 12.4, which is not much better than previous configurations. I got similar results -- recall@100 saturating at 82 -- for another 100 dimensional dataset with 2.5M points; hnsw and FAISS get to 98 recall@100 for this dataset. Is it possible that the NSG graph is disconnected or not navigable for some datasets and dimensions?

DelightRun commented 5 years ago

It's very hard to say, but it does possible that the graph is disconnected. You can send me your dataset (including base, query and groundtruth) and let me have a try, if it is convenient for you.

harsha-simhadri commented 5 years ago

Here are the random 30-dim base (100k) and query (1k) datasets in a zipfile. Thanks for offering to look at it. The 100-dim dataset is private.

rand_d30.zip

harsha-simhadri commented 5 years ago

I repeated the same experiment for random points in 32 dimensions. The recall@20 is 20. I think there is a power of 2 bug in your code. The code seems to work well when number of dimension is 2^d, but is broken for other dimensions

DelightRun commented 5 years ago

Oh, I forgot to talk about that. Since the distance computation is using AVX, so you need to align the memory manually, i.e. align the dimension to multiple of 8 by padding zeros. I will add the related memory align code to this repo soon.

ZJULearning / nsg

Recall for rand dataset with d=30 #15