annoviko / pyclustering

pyclustering is a Python, C++ data mining library.
https://pyclustering.github.io/
BSD 3-Clause "New" or "Revised" License
1.17k stars 249 forks source link

[ccore.xmeans][pyclustering.cluster.xmeans] Amount of centers and amount of clusters not matched #389

Closed annoviko closed 6 years ago

annoviko commented 6 years ago

Introduction Amount of allocated centers is not matched to amount of allocated clusters. This bug wasn't observed in Python part, because centers were calculated by python implementation.

======================================================================
FAIL: testMndlWrongStartClusterAllocationSampleSimple2ByCore (__main__.XmeansIntegrationTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\workspace\pyclustering\pyclustering\cluster\tests\integration\it_xmeans.py", line 74, in testMndlWrongStartClusterAllocationSampleSimple2ByCore
    XmeansTestTemplates.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE2, [[3.5, 4.8], [6.9, 7]], [10, 5, 8], splitting_type.MINIMUM_NOISELESS_DESCRIPTION_LENGTH, 20, True);
  File "D:\workspace\pyclustering\pyclustering\cluster\tests\xmeans_templates.py", line 49, in templateLengthProcessData
    assert len(clusters) == len(centers);
AssertionError

For some tests the similar problem is observed for python implementation:

======================================================================
FAIL: testBicClusterAllocationMaxLessRealSampleSimple4 (__main__.XmeansUnitTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\workspace\pyclustering\pyclustering\cluster\tests\unit\ut_xmeans.py", line 92, in testBicClusterAllocationMaxLessRealSampleSimple4
    XmeansTestTemplates.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE4, [[1.5, 4.0]], None, splitting_type.BAYESIAN_INFORMATION_CRITERION, 2, False);
  File "D:\workspace\pyclustering\pyclustering\cluster\tests\xmeans_templates.py", line 49, in templateLengthProcessData
    assert len(clusters) == len(centers);
AssertionError
himanshu94 commented 6 years ago

The number of clusters I get from pyclustering.cluster.xmeans.xmeans.get_centers is always equal to value of kmax and I checked this by getting clusters while iterating kmax for a range of value.

Thanks

annoviko commented 6 years ago

@himanshu94,

Formally it means that cluster separation process was stopped when kmax was reached.

Have you tried to increase kmax? Could you please provide the data sample and code example that you are using to reproduce the issue?

There is an example of xmeans usage where amount of allocated clusters is less than kmax:

from pyclustering.cluster.xmeans import xmeans;
from pyclustering.cluster.center_initializer import kmeans_plusplus_initializer;

from pyclustering.utils import read_sample;

from pyclustering.samples.definitions import SIMPLE_SAMPLES;

# Read dataset 'SAMPLE_SIMPLE2'
sample = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE2);
initial_centers = kmeans_plusplus_initializer(sample, 3).initialize();

# Use Python implementation
xmeans_instance = xmeans(sample, initial_centers);
xmeans_instance.process();
clusters = xmeans_instance.get_clusters();

# Display allocated clusters
print(clusters);

# Use C/C++ implementation
xmeans_instance = xmeans(sample, initial_centers, ccore=True);
xmeans_instance.process();
clusters = xmeans_instance.get_clusters();

# Display allocated clusters
print(clusters);

Output:

[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [15, 16, 17, 18, 19, 20, 21, 22], [10, 11, 12, 13, 14]]
[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9], [15, 16, 17, 18, 19, 20, 21, 22], [10, 11, 12, 13, 14]]

Here an example of clustering where it started from 2 clusters and finished at 4 cluster and where kmax = 20. elongate_four_clusters

himanshu94 commented 6 years ago

Actuallty I cann't share the data but as I have run Kmeans and evaluated its clusters by Silhoutte value for different iteration I can say that at some point number of clusters formed should be less than kmax and I order to verfiy I ran Xmeans by iterating kmax for a range of values. But number of cluster produced is same as kmax value. But When I ran XMeans of previous version (Before updating to 0.7) then the number of clusters were not equal to Kmax.

Thanks

himanshu94 commented 6 years ago

I verified it using the old version of pyclustering.Then for all the things constant the number of clusters we get is not always equal to kmax. I used the same dataset.

Thanks @annoviko

annoviko commented 6 years ago

@himanshu94, previous version (before 0.7) had two bugs (#326, #328) that have been fixed in 0.7:

- Bug with calculation BIC splitting criterion for X-Means algorithm (pyclustering.cluster.xmeans).
  See: https://github.com/annoviko/pyclustering/issues/326

- Bug with calculation MNDL splitting criterion for X-Means algorithm (pyclustering.cluster.xmeans).
  See: https://github.com/annoviko/pyclustering/issues/328

I will try to verify implementation and add more tests to find out what can be wrong, but without data it's not trivial problem.