Open FrankPortman opened 2 years ago
Hmm right, the problem is that the behaviour is different between the Python Kmeans
object and the C++ Clustering
object (that does not automatically set the INNER_PRODUCT index).
Since I think it would be quite disruptive to change the behavior, we should indeed make this very clear in the docstrings and wiki. Marking this as an enhancement, feel free to submit a PR if you wish.
When using spherical=True during Kmeans clustering, the centroids are normalized after each iteration. The documentation can be found here: https://github.com/facebookresearch/faiss/wiki/Faiss-building-blocks:-clustering,-PCA,-quantization
The same page also says:
However, under the hood KMeans with spherical=True we see it setting an InnerProduct index in the Python bindings: https://github.com/facebookresearch/faiss/blob/1e4586a5a0d7ceb8a538c4878d08b0623a676a08/faiss/python/__init__.py#L1661
and later maximizing the sum of inner products (where the code correctly handles that a higher inner product is better) here: https://github.com/facebookresearch/faiss/blob/26abede81245800c026bd96a386c51430c1f4170/faiss/Clustering.cpp#L537
To summarize the behavior, if you set spherical=True the objective that faiss Kmeans is optimizing and printing out is the sum of inner products of vectors to centroids (would be a sum of cosine sims if the vectors themselves were normed too).
I think Spherical KMeans is potentially synonymous with clustering via cosine/dot similarity as the distance metric, but it is ambiguous enough within this library that several in my org have been tripped up by this. In particular the objective going up for spherical=True could stand to be more clear in documentation.
Would you be open to some documentation and/or code (perhaps printing something?) changes to make this more clear? If you give me a few pointers on how you might prefer this to be cleared up I can make the change and put up a PR.