facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.58k stars 3.65k forks source link

Potentially confusing behavior/documentation when using Kmeans clustering with spherical=True #2466

Open FrankPortman opened 2 years ago

FrankPortman commented 2 years ago

When using spherical=True during Kmeans clustering, the centroids are normalized after each iteration. The documentation can be found here: https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization

spherical: perform spherical k-means -- the centroids are L2 normalized after each iteration

The same page also says:

The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj, and more extensive statistics in kmeans.iteration_stats.

However, under the hood KMeans with spherical=True we see it setting an InnerProduct index in the Python bindings: https://github.com/facebookresearch/faiss/blob/1e4586a5a0d7ceb8a538c4878d08b0623a676a08/faiss/python/__init__.py#L1661

and later maximizing the sum of inner products (where the code correctly handles that a higher inner product is better) here: https://github.com/facebookresearch/faiss/blob/26abede81245800c026bd96a386c51430c1f4170/faiss/Clustering.cpp#L537

To summarize the behavior, if you set spherical=True the objective that faiss Kmeans is optimizing and printing out is the sum of inner products of vectors to centroids (would be a sum of cosine sims if the vectors themselves were normed too).

I think Spherical KMeans is potentially synonymous with clustering via cosine/dot similarity as the distance metric, but it is ambiguous enough within this library that several in my org have been tripped up by this. In particular the objective going up for spherical=True could stand to be more clear in documentation.

Would you be open to some documentation and/or code (perhaps printing something?) changes to make this more clear? If you give me a few pointers on how you might prefer this to be cleared up I can make the change and put up a PR.

mdouze commented 2 years ago

Hmm right, the problem is that the behaviour is different between the Python Kmeans object and the C++ Clustering object (that does not automatically set the INNER_PRODUCT index). Since I think it would be quite disruptive to change the behavior, we should indeed make this very clear in the docstrings and wiki. Marking this as an enhancement, feel free to submit a PR if you wish.