facebookresearch / faiss

A library for efficient similarity search and clustering of dense vectors.
https://faiss.ai
MIT License
31.78k stars 3.66k forks source link

Question: kmean - inertia is not decreasing #1887

Closed nickstr15 closed 3 years ago

nickstr15 commented 3 years ago

Running on:

Interface:

I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):

As a rule of thumb there is no consistent improvement of the k-means quantizer beyond 20 iterations and 1000 * k training points

Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.

Now to my "problem":

The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):

grafik

I tried out varying the number of iterations, number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation (I tested it on a small example with 2D data and very clear clusters and it worked.

Does somebody is familier with this problem and maybe knows what the problem is? Is it that the data is just bad clustered or is there another problem/mistake I miss? Maybe the sclaing of the vlaues between 0 and 1? Should I try an other approach?

mdouze commented 3 years ago

The inertia is the sum of square distances to centroids? This is the k-means clustering objective, so it should become smaller with more clusters. If this does not happen, i suspect there is some bug in the calling code. To debug, you could monitor the clustering error (see Kmeans iteration_stats).

nickstr15 commented 3 years ago

I found my mistake. I had to increase the parameter max_points_per_centroid. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ:

(n > max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled.)

The larger subbatch of course has a larger inertia as there are more points in total.