Open ll931110 opened 1 year ago
Interesting. Our findings are that kmeans++ has the following properties:
I could try to send a PR if it is of interest.
It's true that kmeans++ is a feature request we hear often. Maybe to minimize the effort, you could implement it in python outside Faiss in a first stage and before moving to C++ within Faiss. This way we can assess if it is worth the effort.
My prototype in Python seems straightforward (for both the regular K-means++ and fast K-means++), and the performance is following:
n = 20000 (number of points)
D = 300 (dimension of the points)
K = 100 (number of cluster)
m = 30 (dimension of the points after projecting to the lower dimension)
with vanilla K-means++: 30s
with fast K-means++: 20s.
Implementation (I haven't rigorously checked for correctness: https://gist.github.com/ll931110/2b558dd334be68a7382e88b297a64e5f)
Could you:
Under the SIFT1M dataset, the improved K-means clustering gets 20% worse in term of mean squared error, while the running time is substantially faster:
https://gist.github.com/ll931110/3af3252fe8934caf759047b1593efe2a
D = 960
Found centroids: 1030820.2507730512
Elapsed: 888.442625
m = 120
Found centroids: 1233842.2162609184
Elapsed: 561.8044760000001
m = 30
Found centroids: 1236227.681807482
Elapsed: 322.99551499999984
@mdouze Any thoughts regarding this?
@mdouze Any thoughts regarding this?
Summary
https://arxiv.org/abs/2211.15118 proposes a faster K-means++ algorithm via random projection to a subspace of smaller dimension. Should we consider integrating the algorithm here to Faiss?