Open Sunshine066 opened 1 year ago
Thanks for your interest!
nns() only returns the number of candidates for the query, which is not the list of the NNS index.
The NN index is stored in the list
in your case. You can visit methods/pri_queue.h and use the function ith_id() to get the index one by one.
Please let me know if you meet other issues. Thanks!
int leaf = 40;
p2h::Ball_Tree<double>* tree = new p2h::Ball_Tree<double>(N, D, leaf, XX);
tree->display();
p2h::MinK_List* list = new p2h::MinK_List(10);
std::cout << " nns " << tree->nns(10, 10, 1, X_test[0], list) << std::endl;;
std::cout << list->max_key() << std::endl;
std::cout << list->min_key() << std::endl;
std::cout << list->ith_id(1) << std::endl;
I used function ith_id(), but it's not the return value I was expecting, and the correct nearest index of my X_test[0] is fifth. The input for ith_id() is int. What should I input for ith_id()? How should I use the ith_id() function to get the nearest index of X_test[0] (4 if index start 0 or 5 if index start 1)
i want to sove the problem is : datasets 10 Vectors,the dimension is 512. i want to build ball tree。 and use it to query a vector ,get its index (0~9/1~10).。。。
First of all, the nns() of the ball tree returns the approximate results, not the exact ones.
The input parameter (i.e., int) of ith_id() starts from 0, which means that you should call ith_id(0) if you want to find the nearest neighbor that the ball tree found; similarly, you can call ith_id(1) if you aim to find the 2nd nearest neighbor that ball tree found. The valid range is [0, k-1].
Moreover, the returned value starts from 0, and the returned range is [0, n-1].
Besides, max_key() and min_key() mean the largest and smallest distances the ball tree found. The ith_key(i)
returns the (i+1)
nearest distance.
BTW, I am not clear about the dimension information of your test queries. For the problem of P2HNNS, please ensure that the query should be one dimension higher than the data. Thanks!
For the problem of P2HNNS, please ensure that the query should be one dimension higher than the data
why??? the data create ball tree is 10 512 dimensions。the query data is 1 513 ??????
now ,
First of all, the nns() of the ball tree returns the approximate results, not the exact ones.
The input parameter (i.e., int) of ith_id() starts from 0, which means that you should call ith_id(0) if you want to find the nearest neighbor that the ball tree found; similarly, you can call ith_id(1) if you aim to find the 2nd nearest neighbor that ball tree found. The valid range is [0, k-1].
Moreover, the returned value starts from 0, and the returned range is [0, n-1].
Besides, max_key() and min_key() mean the largest and smallest distances the ball tree found. The
ith_key(i)
returns the(i+1)
nearest distance.BTW, I am not clear about the dimension information of your test queries. For the problem of P2HNNS, please ensure that the query should be one dimension higher than the data. Thanks!
now,,in my code,
XX is a double (size ND) it is used to create ball tree;
X_test is double size 1D
N is the vector numbers D is dimension
the create tree code : p2h::Ball_Tree
Before answering your questions, I suppose I should let you know that this repo targets at the problem of Point-to-Hyperplane Nearest Neighbor Search (P2HNNS). For the problem of P2HNNS, the input datasets are data points in $R^{D-1}$ space. In this repo, I add one more dimension after loading the data, so in ap2h.h, datasets also contain $D$ dimensions. Different from the traditional point-to-point NNS problem, the queries are NOT data points but hyperplane queries that are in $R^D$ space. Moreover, the distance is not the commonly used Euclidean distance but the point-to-hyperplane distance.
So for your issue, you might first check the distance function we used is the one in your considered case; otherwise, you indeed cannot get the correct results. Second, you might check that the query (X_test) is a hyperplane query but not the direct data point. Third, the nns() only accept float for the hyperplane query. You might change double to float*.
Besides, although this is not your test case, I also want to clarify that this repo provides the approximate (not exact) methods for the problem of P2HNNS. Although you did all the correct things, as the ball tree we provided will not check all branches and allow approximation factor, it only provides the approximate results. Thus, it is reasonable if you cannot get the exact results.
Before answering your questions, I suppose I should let you know that this repo targets at the problem of Point-to-Hyperplane Nearest Neighbor Search (P2HNNS). For the problem of P2HNNS, the input datasets are data points in RD−1 space. In this repo, I add one more dimension after loading the data, so in ap2h.h, datasets also contain D dimensions. Different from the traditional point-to-point NNS problem, the queries are NOT data points but hyperplane queries that are in RD space. Moreover, the distance is not the commonly used Euclidean distance but the point-to-hyperplane distance.
So for your issue, you might first check the distance function we used is the one in your considered case; otherwise, you indeed cannot get the correct results. Second, you might check that the query (X_test) is a hyperplane query but not the direct data point. Third, the nns() only accept float for the hyperplane query. You might change double to float*.
Besides, although this is not your test case, I also want to clarify that this repo provides the approximate (not exact) methods for the problem of P2HNNS. Although you did all the correct things, as the ball tree we provided will not check all branches and allow approximation factor, it only provides the approximate results. Thus, it is reasonable if you cannot get the exact results.
Thank you very much for your reply. I have realized where my problem lies. Wish you success in your work, life and study
Ha ha, under your prompt, I just modified the distance method, now your method has been adapted to my point-to-point problem solving, and adopted your code architecture, compared with the ball tree method I used before, the ball tree construction time comparison: 91ms vs 1140ms nns time comparison: <1ms vs 23ms. Your method beats the game. Thank you
Hello, referring to your project, I have realized the functions I want to achieve, thank you. I would like to ask you, according to your understanding of ball tree, can ball tree achieve incremental learning? My understanding is that in incremental learning, if I add a piece of data, I only need to add that piece of data to the last node and update the dimension of the number. It is not necessary to reconstruct the ball tree with (N+1) *D data.
Thank you for your assessments and your appreciation.
For the point-to-point problem you aim to deal with, you might also need to carefully set up the new lower bound (or upper bound) if you consider the approximate case, as the current lower bound I used in ball_node.h
is designed for P2H distance.
For your follow-up question, my suggestion is that the ball tree can support incremental updating (or learning) if the data distribution does not change too much; otherwise, the ball tree might not be the best choice because the pivot points we have chosen for data splitting when building the ball tree will become less effective, and the ball tree might be extremely imbalanced. As such, the nns() will also become less efficient.
An alternative option is that you apply the ball tree for incremental updating (or learning) for only a short period (e.g., one week or 10 days). Then, you might reconstruct the ball tree for the whole dataset (offline) so that we could choose more promising pivots for data splitting.
Thank you for your assessments and your appreciation.
For the point-to-point problem you aim to deal with, you might also need to carefully set up the new lower bound (or upper bound) if you consider the approximate case, as the current lower bound I used in
ball_node.h
is designed for P2H distance.For your follow-up question, my suggestion is that the ball tree can support incremental updating (or learning) if the data distribution does not change too much; otherwise, the ball tree might not be the best choice because the pivot points we have chosen for data splitting when building the ball tree will become less effective, and the ball tree might be extremely imbalanced. As such, the nns() will also become less efficient.
An alternative option is that you apply the ball tree for incremental updating (or learning) for only a short period (e.g., one week or 10 days). Then, you might reconstruct the ball tree for the whole dataset (offline) so that we could choose more promising pivots for data splitting.
Thank you very much for your advice. I have understood my thinking for future projects. The best wishes for you
Hello, I have a new problem in nns. p2h::Ball_Tree tree = new p2h::Ball_Tree(N, D, leaf, XX); //p2h::BC_Treetree = new p2h::BC_Tree(N, D, leaf, XX); tree->display(); p2h::MinK_List* list = new p2h::MinK_List(1) tree->nns(1, int(N), 1, X_test[i], list);
In the case of nns, parameter int(N) has a lot of influence on the result. For example, when N=20000 in the case of tree construction, my int(N) is int( 0.2n), so when the correct result of my test data is >4000 (0.2 20000 = 4000), the wrong result will appear. I have to enlarge int(N) to int(N), but that would be a brute force search, so where is my problem?
Hello, I have a new problem in nns. p2h::Ball_Tree* tree = new p2h::Ball_Tree(N, D, leaf, XX); //p2h::BC_Tree_tree = new p2h::BC_Tree(N, D, leaf, XX); tree->display(); p2h::MinKList list = new p2h::MinK_List(1) tree->nns(1, int(N), 1, X_test[i], list);
In the case of nns, parameter int(N) has a lot of influence on the result. For example, when N=20000 in the case of tree construction, my int(N) is int( 0.2n), so when the correct result of my test data is >4000 (0.2 20000 = 4000), the wrong result will appear. I have to enlarge int(N) to int(N), but that would be a brute force search, so where is my problem?
I change the Ball_Tree leaf the same as int(N 0.2) or int(N 0.5) , the problem maybe solve.But the Accuracy rate Serious decline
Hello, I have a new problem in nns. p2h::Ball_Tree* tree = new p2h::Ball_Tree(N, D, leaf, XX); //p2h::BC_Tree_tree = new p2h::BC_Tree(N, D, leaf, XX); tree->display(); p2h::MinKList list = new p2h::MinK_List(1) tree->nns(1, int(N), 1, X_test[i], list);
In the case of nns, parameter int(N) has a lot of influence on the result. For example, when N=20000 in the case of tree construction, my int(N) is int( 0.2n), so when the correct result of my test data is >4000 (0.2 20000 = 4000), the wrong result will appear. I have to enlarge int(N) to int(N), but that would be a brute force search, so where is my problem?
Yes, it might happen. The parameter int(N) represents the number of candidates you aim to check in this nns() function. Basically, we are only interested in k=100 NN of the query, then int(N) is only 2 to 5 times of k, e.g., [2k, 5k]. As such, nns() could be efficient as it will not need to check too many candidates. However, if you set too large for this parameter, e.g., int(N) = 0.2*n or n, then the efficiency will degrade rapidly. You might consider using a factor of k to set up this parameter rather than using a fraction of n.
But the result of my testing, for example when I was building the ball tree, was that the data was 2w * 512. If int cand, // number of candidates is 4000 when I conduct nns, then it can guarantee 100% correctness only when the test data is 0~4000, but when the test data index is >4000, the accuracy is 0%. So I guess int cand must >=leaf. Based on this, I conducted the test again, set leaf = 1w, int cand = 5000, and the test accuracy was less than 50%
I suppose it is not the problem of the cand
setting. Basically, the sweet range of the leaf
size is in between 20 to 500. In most cases, we do not suggest setting a very large value for leaf.
I suppose the reason is that the ball tree we built is designed for point-to-hyperplane distance, which might NOT be the distance (e.g., Euclidean distance) you considered for your problem. Thus, when you build a ball tree with more than a leaf, as the lower bound is not correct for your problem and you might choose a wrong branch, the accuracy degrades rapidly.
The construction of the ball tree I currently designed could also support Euclidean distance, but you use perform nns() by considering your own distance, you should carefully modify the est_lower_bound
function in ball_node.h and the center preference in Lines 153-162 in ball_node.h (here, you might choose the branch that the distance (you consider, not inner product) between the center and query is closer) when traversing the ball tree from root to leaves.
I hope my suggestions could be useful for you, but directly solving your problem is out of the scope of this repo, and I do not have much free time for the indirect problem. Wish a great success for you!
I suppose it is not the problem of the
cand
setting. Basically, the sweet range of theleaf
size is in between 20 to 500. In most cases, we do not suggest setting a very large value for leaf.I suppose the reason is that the ball tree we built is designed for point-to-hyperplane distance, which might NOT be the distance (e.g., Euclidean distance) you considered for your problem. Thus, when you build a ball tree with more than a leaf, as the lower bound is not correct for your problem and you might choose a wrong branch, the accuracy degrades rapidly.
The construction of the ball tree I currently designed could also support Euclidean distance, but you use perform nns() by considering your own distance, you should carefully modify the
est_lower_bound
function in ball_node.h and the center preference in Lines 153-162 in ball_node.h (here, you might choose the branch that the distance (you consider, not inner product) between the center and query is closer) when traversing the ball tree from root to leaves.I hope my suggestions could be useful for you, but directly solving your problem is out of the scope of this repo, and I do not have much free time for the indirect problem. Wish a great success for you!
Thank you very much for your suggestions. I hope my questions will not affect your work. I am satisfied that you can answer my questions, thank you
Excuse me, is there no query function in ball_tree? How do I get query's nearest neighbor vector index this is my code my code:XX is a double (size ND) X_test is double** N is 10 D is 512
How to get X_test[0]'s nearest neighbor index tasnks