facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.

https://faiss.ai

MIT License

31.24k stars 3.63k forks source link

Consider implementing the faster K-means++ algorithm #2661

Open ll931110 opened 1 year ago

ll931110 commented 1 year ago

Summary

https://arxiv.org/abs/2211.15118 proposes a faster K-means++ algorithm via random projection to a subspace of smaller dimension. Should we consider integrating the algorithm here to Faiss?

mdouze commented 1 year ago

Interesting. Our findings are that kmeans++ has the following properties:

improved final MSE w.r.t. random initialization
slow (which is what the paper addresses)
imperceptible impact w.r.t. random init for quantization

ll931110 commented 1 year ago

I could try to send a PR if it is of interest.

mdouze commented 1 year ago

It's true that kmeans++ is a feature request we hear often. Maybe to minimize the effort, you could implement it in python outside Faiss in a first stage and before moving to C++ within Faiss. This way we can assess if it is worth the effort.

ll931110 commented 1 year ago

My prototype in Python seems straightforward (for both the regular K-means++ and fast K-means++), and the performance is following:

    n = 20000 (number of points)
    D = 300 (dimension of the points)
    K = 100 (number of cluster)
    m = 30 (dimension of the points after projecting to the lower dimension)

with vanilla K-means++: 30s
with fast K-means++: 20s.

Implementation (I haven't rigorously checked for correctness: https://gist.github.com/ll931110/2b558dd334be68a7382e88b297a64e5f)

mdouze commented 1 year ago

Could you:

log the mean squared error
run on a non-random dataset? Random data does not cluster well by definition. Please use SIFT1M for example (https://github.com/facebookresearch/faiss/blob/main/benchs/README.md#getting-sift1m)

ll931110 commented 1 year ago

Under the SIFT1M dataset, the improved K-means clustering gets 20% worse in term of mean squared error, while the running time is substantially faster:

https://gist.github.com/ll931110/3af3252fe8934caf759047b1593efe2a

D = 960
Found centroids:  1030820.2507730512
Elapsed:  888.442625

m = 120
Found centroids:  1233842.2162609184
Elapsed:  561.8044760000001

m = 30
Found centroids:  1236227.681807482
Elapsed:  322.99551499999984

ll931110 commented 1 year ago

@mdouze Any thoughts regarding this?

ll931110 commented 1 year ago

@mdouze Any thoughts regarding this?