Closed wangyuran closed 7 years ago
Hello @wangyuran - This is a great request - thanks for the feedback!
The requested change is now landed. Please let me know if there is anything else! https://github.com/facebookresearch/pysparnn/commit/b726a229b2374724ceeddce6ea124c18b1bd8d15
Ill leave this request open for another week then close if i don't hear back.
@spencebeecher , for the index in the example, the doc_index are for the whole dataset, shouldn't it be only for the training dataset?
@spencebeecher . thank you very much. The output is k neighbors now. However, the new issue is the top 1 NN accuracy in my case drops more than half (68% to 28%). Since the change is just the stopping criteria, I am not sure why this happens.
@wangyuran - Oh no! Can you increase the number of k_clusters you search as a quick patch? If you send me a notebook / script I can try to debug it with you. Ill look at this more later on tonight. You can always go back one revision in github to get the old behavior. Let me know what you discover!
Adding info - i just re-ran this example - https://github.com/facebookresearch/pysparnn/blob/master/examples/sparse_search_comparison.ipynb
I get very similar results from before (~60% recall for pysparnn)
Hi Spence, I tried to rerun with the earlier version (Mar 19), it did give similar results. Maybe for my system, the results was not very stable. I tried to run a few times, the values fluctuates quite a bit (5% to 68%), but most of them are about 30%.
Anyway, thanks for fixing the issue.
Thanks, Yuran
On Tue, May 30, 2017 at 11:20 AM, Spence Beecher notifications@github.com wrote:
Adding info - i just re-ran this example - https://github.com/ facebookresearch/pysparnn/blob/master/examples/sparse_ search_comparison.ipynb
I get very similar results from before (~60% recall for pysparnn)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/pysparnn/issues/14#issuecomment-304912876, or mute the thread https://github.com/notifications/unsubscribe-auth/AThE9QvujdrwB-F__IzDDJS09AOR5pi8ks5r_DPZgaJpZM4NhTSf .
-- Best, Yuran Wang
Hi Yaran - to improve recall you might try MultiClusterIndex(num_indexes=2) or 3. You can also increase the k_clusters parameter for more recall. I would also check to see if the k results you get back still have reasonable distances. It could be that the 2nd best item still isnt so far off.
Finally - if your space is very very sparse you can try bumping the matrix_size param when creating the indexes. Note - increasing this param makes shallower trees and eventually becomes brute force search.
Hi Spence, Thanks a lot for the suggestions. I started with num_indexes=2. The results only become comparable with a cosine brute force method when num_indexes is about 10.
Thanks, Yuran
On Tue, May 30, 2017 at 4:38 PM, Spence Beecher notifications@github.com wrote:
Hi Yaran - to improve recall you might try MultiClusterIndex(num_indexes=2) or 3. You can also increase the k_clusters parameter for more recall. I would also check to see if the k results you get back still have reasonable distances. It could be that the 2nd best item still isnt so far off.
Finally - if your space is very very sparse you can try bumping the matrix_size param when creating the indexes. Note - increasing this param makes shallower trees and eventually becomes brute force search.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/pysparnn/issues/14#issuecomment-305000889, or mute the thread https://github.com/notifications/unsubscribe-auth/AThE9WdEO8Y6rety8zowchFfygA-6oA3ks5r_H4xgaJpZM4NhTSf .
-- Best, Yuran Wang
I use the MultiClusterIndex class. With the search method, I only changed the parameter k to 10, but fewer than 10 nearest neighbors are found, only 80% of the examples returned 10 NNs. What I can adjust to make sure that I get 10 NNs for all the cases?
Another question is in the examples, the doc_index are for the whole dataset. Isn’t it should be only for the training dataset?