about query - Githubissues

Sunshine066 commented 1 year ago

Excuse me, is there no query function in ball_tree? How do I get query's nearest neighbor vector index this is my code my code：XX is a double (size ND) X_test is double** N is 10 D is 512

int leaf = 40;
p2h::Ball_Tree<double>* tree = new p2h::Ball_Tree<double>(N, D, leaf, XX);
tree->display();
p2h::MinK_List* list = new  p2h::MinK_List(10);
std::cout << "   nns   " << tree->nns(10, 10, 1, X_test[0], list) << std::endl;;

How to get X_test[0]'s nearest neighbor index tasnks

HuangQiang commented 1 year ago

Thanks for your interest!

nns() only returns the number of candidates for the query, which is not the list of the NNS index.

The NN index is stored in the list in your case. You can visit methods/pri_queue.h and use the function ith_id() to get the index one by one.

Please let me know if you meet other issues. Thanks!

Sunshine066 commented 1 year ago

int leaf = 40;
p2h::Ball_Tree<double>* tree = new p2h::Ball_Tree<double>(N, D, leaf, XX);
tree->display();
p2h::MinK_List* list = new  p2h::MinK_List(10);
std::cout << "   nns   " << tree->nns(10, 10, 1, X_test[0], list) << std::endl;;
std::cout << list->max_key() << std::endl;
std::cout << list->min_key() << std::endl;
std::cout << list->ith_id(1) << std::endl;

I used function ith_id(), but it's not the return value I was expecting, and the correct nearest index of my X_test[0] is fifth. The input for ith_id() is int. What should I input for ith_id()? How should I use the ith_id() function to get the nearest index of X_test[0] (4 if index start 0 or 5 if index start 1)

Sunshine066 commented 1 year ago

i want to sove the problem is : datasets 10 Vectors，the dimension is 512. i want to build ball tree。 and use it to query a vector ，get its index （0~9/1~10）.。。。

HuangQiang commented 1 year ago

First of all, the nns() of the ball tree returns the approximate results, not the exact ones.

The input parameter (i.e., int) of ith_id() starts from 0, which means that you should call ith_id(0) if you want to find the nearest neighbor that the ball tree found; similarly, you can call ith_id(1) if you aim to find the 2nd nearest neighbor that ball tree found. The valid range is [0, k-1].

Moreover, the returned value starts from 0, and the returned range is [0, n-1].

Besides, max_key() and min_key() mean the largest and smallest distances the ball tree found. The ith_key(i) returns the (i+1) nearest distance.

BTW, I am not clear about the dimension information of your test queries. For the problem of P2HNNS, please ensure that the query should be one dimension higher than the data. Thanks!

Sunshine066 commented 1 year ago

For the problem of P2HNNS, please ensure that the query should be one dimension higher than the data

why？？？ the data create ball tree is 10 512 dimensions。the query data is 1 513 ？？？？？？

Sunshine066 commented 1 year ago

now ，

First of all, the nns() of the ball tree returns the approximate results, not the exact ones.

The input parameter (i.e., int) of ith_id() starts from 0, which means that you should call ith_id(0) if you want to find the nearest neighbor that the ball tree found; similarly, you can call ith_id(1) if you aim to find the 2nd nearest neighbor that ball tree found. The valid range is [0, k-1].

Moreover, the returned value starts from 0, and the returned range is [0, n-1].

Besides, max_key() and min_key() mean the largest and smallest distances the ball tree found. The ith_key(i) returns the (i+1) nearest distance.

BTW, I am not clear about the dimension information of your test queries. For the problem of P2HNNS, please ensure that the query should be one dimension higher than the data. Thanks!

now，，in my code， XX is a double (size ND) it is used to create ball tree； X_test is double size 1D
N is the vector numbers D is dimension the create tree code ： p2h::Ball_Tree tree = new p2h::Ball_Tree(N, D, leaf, XX); find min distance index code : p2h::MinK_List list = new p2h::MinK_List(1); std::cout << " nns " << tree->nns(10, 10, 1, X_tes], list) << std::endl;; std::cout << list->ith_id(0) << std::endl; this is min index,but my code result is incorrect result

HuangQiang commented 1 year ago

Before answering your questions, I suppose I should let you know that this repo targets at the problem of Point-to-Hyperplane Nearest Neighbor Search (P2HNNS). For the problem of P2HNNS, the input datasets are data points in $R^{D-1}$ space. In this repo, I add one more dimension after loading the data, so in ap2h.h, datasets also contain $D$ dimensions. Different from the traditional point-to-point NNS problem, the queries are NOT data points but hyperplane queries that are in $R^D$ space. Moreover, the distance is not the commonly used Euclidean distance but the point-to-hyperplane distance.

So for your issue, you might first check the distance function we used is the one in your considered case; otherwise, you indeed cannot get the correct results. Second, you might check that the query (X_test) is a hyperplane query but not the direct data point. Third, the nns() only accept float for the hyperplane query. You might change double to float*.

Besides, although this is not your test case, I also want to clarify that this repo provides the approximate (not exact) methods for the problem of P2HNNS. Although you did all the correct things, as the ball tree we provided will not check all branches and allow approximation factor, it only provides the approximate results. Thus, it is reasonable if you cannot get the exact results.

Sunshine066 commented 1 year ago

Before answering your questions, I suppose I should let you know that this repo targets at the problem of Point-to-Hyperplane Nearest Neighbor Search (P2HNNS). For the problem of P2HNNS, the input datasets are data points in RD−1 space. In this repo, I add one more dimension after loading the data, so in ap2h.h, datasets also contain D dimensions. Different from the traditional point-to-point NNS problem, the queries are NOT data points but hyperplane queries that are in RD space. Moreover, the distance is not the commonly used Euclidean distance but the point-to-hyperplane distance.

So for your issue, you might first check the distance function we used is the one in your considered case; otherwise, you indeed cannot get the correct results. Second, you might check that the query (X_test) is a hyperplane query but not the direct data point. Third, the nns() only accept float for the hyperplane query. You might change double to float*.

Besides, although this is not your test case, I also want to clarify that this repo provides the approximate (not exact) methods for the problem of P2HNNS. Although you did all the correct things, as the ball tree we provided will not check all branches and allow approximation factor, it only provides the approximate results. Thus, it is reasonable if you cannot get the exact results.

Thank you very much for your reply. I have realized where my problem lies. Wish you success in your work, life and study

Sunshine066 commented 1 year ago

Ha ha, under your prompt, I just modified the distance method, now your method has been adapted to my point-to-point problem solving, and adopted your code architecture, compared with the ball tree method I used before, the ball tree construction time comparison: 91ms vs 1140ms nns time comparison: <1ms vs 23ms. Your method beats the game. Thank you

Sunshine066 commented 1 year ago

Hello, referring to your project, I have realized the functions I want to achieve, thank you. I would like to ask you, according to your understanding of ball tree, can ball tree achieve incremental learning? My understanding is that in incremental learning, if I add a piece of data, I only need to add that piece of data to the last node and update the dimension of the number. It is not necessary to reconstruct the ball tree with (N+1) *D data.

HuangQiang commented 1 year ago

Thank you for your assessments and your appreciation.

For the point-to-point problem you aim to deal with, you might also need to carefully set up the new lower bound (or upper bound) if you consider the approximate case, as the current lower bound I used in ball_node.h is designed for P2H distance.

For your follow-up question, my suggestion is that the ball tree can support incremental updating (or learning) if the data distribution does not change too much; otherwise, the ball tree might not be the best choice because the pivot points we have chosen for data splitting when building the ball tree will become less effective, and the ball tree might be extremely imbalanced. As such, the nns() will also become less efficient.

An alternative option is that you apply the ball tree for incremental updating (or learning) for only a short period (e.g., one week or 10 days). Then, you might reconstruct the ball tree for the whole dataset (offline) so that we could choose more promising pivots for data splitting.

Sunshine066 commented 1 year ago

Thank you for your assessments and your appreciation.

For the point-to-point problem you aim to deal with, you might also need to carefully set up the new lower bound (or upper bound) if you consider the approximate case, as the current lower bound I used in ball_node.h is designed for P2H distance.

For your follow-up question, my suggestion is that the ball tree can support incremental updating (or learning) if the data distribution does not change too much; otherwise, the ball tree might not be the best choice because the pivot points we have chosen for data splitting when building the ball tree will become less effective, and the ball tree might be extremely imbalanced. As such, the nns() will also become less efficient.

An alternative option is that you apply the ball tree for incremental updating (or learning) for only a short period (e.g., one week or 10 days). Then, you might reconstruct the ball tree for the whole dataset (offline) so that we could choose more promising pivots for data splitting.

Thank you very much for your advice. I have understood my thinking for future projects. The best wishes for you

Sunshine066 commented 1 year ago

Hello, I have a new problem in nns. p2h::Ball_Tree tree = new p2h::Ball_Tree(N, D, leaf, XX); //p2h::BC_Treetree = new p2h::BC_Tree(N, D, leaf, XX); tree->display(); p2h::MinK_List* list = new p2h::MinK_List(1) tree->nns(1, int(N), 1, X_test[i], list);

In the case of nns, parameter int(N) has a lot of influence on the result. For example, when N=20000 in the case of tree construction, my int(N) is int( 0.2n), so when the correct result of my test data is >4000 (0.2 20000 = 4000), the wrong result will appear. I have to enlarge int(N) to int(N), but that would be a brute force search, so where is my problem?

Sunshine066 commented 1 year ago

Hello, I have a new problem in nns. p2h::Ball_Tree* tree = new p2h::Ball_Tree(N, D, leaf, XX); //p2h::BC_Tree_tree = new p2h::BC_Tree(N, D, leaf, XX); tree->display(); p2h::MinKList list = new p2h::MinK_List(1) tree->nns(1, int(N), 1, X_test[i], list);

In the case of nns, parameter int(N) has a lot of influence on the result. For example, when N=20000 in the case of tree construction, my int(N) is int( 0.2n), so when the correct result of my test data is >4000 (0.2 20000 = 4000), the wrong result will appear. I have to enlarge int(N) to int(N), but that would be a brute force search, so where is my problem?

I change the Ball_Tree leaf the same as int(N 0.2) or int(N 0.5) , the problem maybe solve.But the Accuracy rate Serious decline

HuangQiang commented 1 year ago

Hello, I have a new problem in nns. p2h::Ball_Tree* tree = new p2h::Ball_Tree(N, D, leaf, XX); //p2h::BC_Tree_tree = new p2h::BC_Tree(N, D, leaf, XX); tree->display(); p2h::MinKList list = new p2h::MinK_List(1) tree->nns(1, int(N), 1, X_test[i], list);

In the case of nns, parameter int(N) has a lot of influence on the result. For example, when N=20000 in the case of tree construction, my int(N) is int( 0.2n), so when the correct result of my test data is >4000 (0.2 20000 = 4000), the wrong result will appear. I have to enlarge int(N) to int(N), but that would be a brute force search, so where is my problem?

Yes, it might happen. The parameter int(N) represents the number of candidates you aim to check in this nns() function. Basically, we are only interested in k=100 NN of the query, then int(N) is only 2 to 5 times of k, e.g., [2k, 5k]. As such, nns() could be efficient as it will not need to check too many candidates. However, if you set too large for this parameter, e.g., int(N) = 0.2*n or n, then the efficiency will degrade rapidly. You might consider using a factor of k to set up this parameter rather than using a fraction of n.

Sunshine066 commented 1 year ago

But the result of my testing, for example when I was building the ball tree, was that the data was 2w * 512. If int cand, // number of candidates is 4000 when I conduct nns, then it can guarantee 100% correctness only when the test data is 0~4000, but when the test data index is >4000, the accuracy is 0%. So I guess int cand must >=leaf. Based on this, I conducted the test again, set leaf = 1w, int cand = 5000, and the test accuracy was less than 50%

HuangQiang commented 1 year ago

I suppose it is not the problem of the cand setting. Basically, the sweet range of the leaf size is in between 20 to 500. In most cases, we do not suggest setting a very large value for leaf.

I suppose the reason is that the ball tree we built is designed for point-to-hyperplane distance, which might NOT be the distance (e.g., Euclidean distance) you considered for your problem. Thus, when you build a ball tree with more than a leaf, as the lower bound is not correct for your problem and you might choose a wrong branch, the accuracy degrades rapidly.

The construction of the ball tree I currently designed could also support Euclidean distance, but you use perform nns() by considering your own distance, you should carefully modify the est_lower_bound function in ball_node.h and the center preference in Lines 153-162 in ball_node.h (here, you might choose the branch that the distance (you consider, not inner product) between the center and query is closer) when traversing the ball tree from root to leaves.

I hope my suggestions could be useful for you, but directly solving your problem is out of the scope of this repo, and I do not have much free time for the indirect problem. Wish a great success for you!

Sunshine066 commented 1 year ago

I suppose it is not the problem of the cand setting. Basically, the sweet range of the leaf size is in between 20 to 500. In most cases, we do not suggest setting a very large value for leaf.

I suppose the reason is that the ball tree we built is designed for point-to-hyperplane distance, which might NOT be the distance (e.g., Euclidean distance) you considered for your problem. Thus, when you build a ball tree with more than a leaf, as the lower bound is not correct for your problem and you might choose a wrong branch, the accuracy degrades rapidly.

The construction of the ball tree I currently designed could also support Euclidean distance, but you use perform nns() by considering your own distance, you should carefully modify the est_lower_bound function in ball_node.h and the center preference in Lines 153-162 in ball_node.h (here, you might choose the branch that the distance (you consider, not inner product) between the center and query is closer) when traversing the ball tree from root to leaves.

I hope my suggestions could be useful for you, but directly solving your problem is out of the scope of this repo, and I do not have much free time for the indirect problem. Wish a great success for you!

Thank you very much for your suggestions. I hope my questions will not affect your work. I am satisfied that you can answer my questions, thank you

HuangQiang / BC-Tree

about query #1