facebookresearch / pysparnn

Approximate Nearest Neighbor Search for Sparse Data in Python!
Other
915 stars 146 forks source link

Improve top-level item selection with k-means++ initialisation #24

Open lsorber opened 5 years ago

lsorber commented 5 years ago

Currently, the top-level items are selected randomly [1], and are then pruned by removing items that are too similar [2]. This procedure might result in suboptimal top-level items being selected.

One trick for selecting the cluster centroids that has worked exceptionally well for k-means is the k-means++ initialisation [3]. The idea is to start with a randomly chosen first centroid, and then to select following centroids with a probability proportional to the distance to the already selected centroids.

An implementation of this algorithm should be relatively straightforward, with potentially large benefits. Would there be any interesting in adding such an initialisation to PySparNN?

[1] https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py#L127 [2] https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py#L135 [3] https://en.wikipedia.org/wiki/K-means%2B%2B#Improved_initialization_algorithm

kchaliki commented 4 years ago

@lsorber does this help your usecase?

https://github.com/facebookresearch/pysparnn/pull/28

I realize it's not the same approach but hopefully achieves similar results in practice.

spencebeecher commented 4 years ago

@lsorber - apologies for missing this. if you still have interest in submitting a PR I would happily integrate it. I think @kchaliki has a very good alternative which I plan on getting integrated over the next few days.