Closed nickstr15 closed 3 years ago
The inertia is the sum of square distances to centroids? This is the k-means clustering objective, so it should become smaller with more clusters. If this does not happen, i suspect there is some bug in the calling code. To debug, you could monitor the clustering error (see Kmeans iteration_stats).
I found my mistake. I had to increase the parameter max_points_per_centroid
. As I have so many data points it sampled a sub-batch for the fit. For a larger number of clusters this sub-batch is larger. See FAQ:
(n > max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled.)
The larger subbatch of course has a larger inertia as there are more points in total.
Running on:
Interface:
I am trying to use the KMeans clustering from faiss on a human pose dataset of body joints. I have 16 body parts so a dimension of 32. The joints are scaled in a range between 0 and 1. My dataset consists of ~ 900.000 instances. As mentioned by faiss (faiss_FAQ):
Applying this to my problem I randomly select 50000 instances for training. As I want to check for a number of clusters k between 1 and 30.
Now to my "problem":
The inertia is increasing directly as the number of cluster increases (n_cluster on the x-axis):
I tried out varying the number of iterations, number of redos, verbose and spherical, but the results stay the same or get worse. I do not think that it is a problem of my implementation (I tested it on a small example with 2D data and very clear clusters and it worked.
Does somebody is familier with this problem and maybe knows what the problem is? Is it that the data is just bad clustered or is there another problem/mistake I miss? Maybe the sclaing of the vlaues between 0 and 1? Should I try an other approach?