Closed annoviko closed 6 years ago
The number of clusters I get from pyclustering.cluster.xmeans.xmeans.get_centers is always equal to value of kmax and I checked this by getting clusters while iterating kmax for a range of value.
Thanks
@himanshu94,
Formally it means that cluster separation process was stopped when kmax was reached.
Have you tried to increase kmax? Could you please provide the data sample and code example that you are using to reproduce the issue?
There is an example of xmeans usage where amount of allocated clusters is less than kmax:
from pyclustering.cluster.xmeans import xmeans;
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer;
from pyclustering.utils import read_sample;
from pyclustering.samples.definitions import SIMPLE_SAMPLES;
# Read dataset 'SAMPLE_SIMPLE2'
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE2);
initial_centers = kmeans_plusplus_initializer(sample, 3).initialize();
# Use Python implementation
xmeans_instance = xmeans(sample, initial_centers);
xmeans_instance.process();
clusters = xmeans_instance.get_clusters();
# Display allocated clusters
print(clusters);
# Use C/C++ implementation
xmeans_instance = xmeans(sample, initial_centers, ccore=True);
xmeans_instance.process();
clusters = xmeans_instance.get_clusters();
# Display allocated clusters
print(clusters);
Output:
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [15, 16, 17, 18, 19, 20, 21, 22], [10, 11, 12, 13, 14]]
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [15, 16, 17, 18, 19, 20, 21, 22], [10, 11, 12, 13, 14]]
Here an example of clustering where it started from 2 clusters and finished at 4 cluster and where kmax = 20.
Actuallty I cann't share the data but as I have run Kmeans and evaluated its clusters by Silhoutte value for different iteration I can say that at some point number of clusters formed should be less than kmax and I order to verfiy I ran Xmeans by iterating kmax for a range of values. But number of cluster produced is same as kmax value. But When I ran XMeans of previous version (Before updating to 0.7) then the number of clusters were not equal to Kmax.
Thanks
I verified it using the old version of pyclustering.Then for all the things constant the number of clusters we get is not always equal to kmax. I used the same dataset.
Thanks @annoviko
@himanshu94, previous version (before 0.7) had two bugs (#326, #328) that have been fixed in 0.7:
- Bug with calculation BIC splitting criterion for X-Means algorithm (pyclustering.cluster.xmeans).
See: https://github.com/annoviko/pyclustering/issues/326
- Bug with calculation MNDL splitting criterion for X-Means algorithm (pyclustering.cluster.xmeans).
See: https://github.com/annoviko/pyclustering/issues/328
I will try to verify implementation and add more tests to find out what can be wrong, but without data it's not trivial problem.
Introduction Amount of allocated centers is not matched to amount of allocated clusters. This bug wasn't observed in Python part, because centers were calculated by python implementation.
For some tests the similar problem is observed for python implementation: