facebookresearch / pysparnn

Approximate Nearest Neighbor Search for Sparse Data in Python!
Other
915 stars 146 forks source link

refactor cluster selection so it can be replaced, add dbscan based clustering #28

Open kchaliki opened 4 years ago

kchaliki commented 4 years ago

Sending this here in case it is useful for anyone else, I found that for my use case of searching through a large body of music tracks (converted to 3-grams), trying to find the same track in different parts of the library I was getting bad results even with k_clusters=500 (and the obvious speed disadvantage). This change seems to make it behave as good as the greedy knn in scikit with k_clusters=5 hence many orders of magnitude faster.

facebook-github-bot commented 4 years ago

Hi kchaliki! Thank you for your pull request and welcome to our community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file.In order for us to review and merge your code, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

If you have received this in error or have any questions, please contact us at cla@fb.com. Thanks!

facebook-github-bot commented 4 years ago

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Facebook open source project. Thanks!

spencebeecher commented 4 years ago

This is great! I like the architecture change to the code.

One thing I would like to add is an example in one of the ipython notebooks - specifically this one - https://github.com/facebookresearch/pysparnn/blob/master/examples/sparse_search_comparison.ipynb

The other question that I have - is this only intended to be used with ClusterIndex and not MultiClusterIndex?

kchaliki commented 4 years ago

Thanks for looking at this @spencebeecher - to be honest I didn't really look at what MultiClusterIndex is about in a lot of detail, do you think it should also be used there? I will definitely add to the notebook in the coming days.

kchaliki commented 4 years ago

So it looks like LSHForest which you are comparing against in the notebook has been deprecated and in fact removed in later versions of sklearn in favour of other ANN libraries like this one :joy: - discussion is here

I tried going back to the original requirements versions in the package but can no longer get numpy 1.11.2 to pip install due to header changes in Ubuntu. I can most likely get it to work by juggling versions but perhaps it doesn't make sense to compare to LSHForest any more anyway?

kchaliki commented 4 years ago

Hey @spencebeecher - have you decided what you want to do about the notebook comparisons?