annoviko / pyclustering

pyclustering is a Python, C++ data mining library.
https://pyclustering.github.io/
BSD 3-Clause "New" or "Revised" License
1.17k stars 249 forks source link

[pyclustering.cluster.xmeans] bic calculation #326

Closed tilmanbeck closed 7 years ago

tilmanbeck commented 7 years ago

according to the xmeans paper and the discussion here the term: p_j / 2 log(R) has to be subtracted from the sum of the log-likelihoods. So, shouldn't be line 383 in xmeans.py be replaced by: `return sum(scores) - ( K - 1 + dimension K + 1) * math.log(N)`

annoviko commented 7 years ago

Hello, @tilmanbeck

Thank you for the findings, I have applied this correction and two tests are failed:

....................
Obtained: [15, 15, 15, 30] Expected: [15, 15, 15, 15, 15]
F...
Obtained: [800] Expected: [400, 400]
F.............................
======================================================================
FAIL: testBicWrongStartClusterAllocationSampleSimple4 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/andrei/workspace/pyclustering/pyclustering/cluster/tests/xmeans_tests.py", line 145, in testBicWrongStartClusterAllocationSampleSimple4
    self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE4, [[1.5, 0.0], [1.5, 2.0], [1.5, 4.0], [1.5, 6.0]], [15, 15, 15, 15, 15], splitting_type.BAYESIAN_INFORMATION_CRITERION);
  File "/home/andrei/workspace/pyclustering/pyclustering/cluster/tests/xmeans_tests.py", line 52, in templateLengthProcessData
    assert obtained_cluster_sizes == expected_cluster_length;
AssertionError

======================================================================
FAIL: testBicWrongStartClusterAllocationSampleTwoDiamonds (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/andrei/workspace/pyclustering/pyclustering/cluster/tests/xmeans_tests.py", line 193, in testBicWrongStartClusterAllocationSampleTwoDiamonds
    self.templateLengthProcessData(FCPS_SAMPLES.SAMPLE_TWO_DIAMONDS, [[0.8, 0.2]], [400, 400], splitting_type.BAYESIAN_INFORMATION_CRITERION);
  File "/home/andrei/workspace/pyclustering/pyclustering/cluster/tests/xmeans_tests.py", line 52, in templateLengthProcessData
    assert obtained_cluster_sizes == expected_cluster_length;
AssertionError

----------------------------------------------------------------------
Ran 54 tests in 0.170s

FAILED (failures=2)

I agree that at the same time other similar tests are passed (when we have wrong initial centers), and agree about the paper. I will investigate failures with these two tests and then I will apply your correction.

Or probably you have already got comments to these failures?

# The first failed test:
self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE4, [[1.5, 0.0], [1.5, 2.0], [1.5, 4.0], [1.5, 6.0]], [15, 15, 15, 15, 15], splitting_type.BAYESIAN_INFORMATION_CRITERION);

# The second failed test:
self.templateLengthProcessData(FCPS_SAMPLES.SAMPLE_TWO_DIAMONDS, [[0.8, 0.2]], [400, 400], splitting_type.BAYESIAN_INFORMATION_CRITERION);
tilmanbeck commented 7 years ago

Hi @annoviko , sorry, in my proposed correction I myself forgot the division by 2. I also simplified the math, should be: return sum(scores) - ((K * (dimension + 1)) / 2.0) * math.log(N);

I applied your tests and now only the second test still fails. I will look deeper into the test.

tilmanbeck commented 7 years ago

okay, I replaced in line 369:

sigma += (euclidean_distance_sqrt(self.__pointer_data[index_object], centers[index_cluster]));

In the paper, they are also using the squared difference. That made it work, however now 4 other tests are failing:

======================================================================
FAIL: testBicClusterAllocationSampleSimple2 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "xmeans_tests.py", line 89, in testBicClusterAllocationSampleSimple2
    self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE2, [[3.5, 4.8], [6.9, 7], [7.5, 0.5]], [10, 5, 8], splitting_type.BAYESIAN_INFORMATION_CRITERION);
  File "xmeans_tests.py", line 50, in templateLengthProcessData
    assert obtained_cluster_sizes == expected_cluster_length;
AssertionError

======================================================================
FAIL: testBicClusterAllocationSampleSimple5 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "xmeans_tests.py", line 161, in testBicClusterAllocationSampleSimple5
    self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE5, [[0.0, 1.0], [0.0, 0.0], [1.0, 1.0], [1.0, 0.0]], [15, 15, 15, 15], splitting_type.BAYESIAN_INFORMATION_CRITERION);
  File "xmeans_tests.py", line 50, in templateLengthProcessData
    assert obtained_cluster_sizes == expected_cluster_length;
AssertionError

======================================================================
FAIL: testBicWrongStartClusterAllocationSampleSimple2 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "xmeans_tests.py", line 95, in testBicWrongStartClusterAllocationSampleSimple2
    self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE2, [[3.5, 4.8], [6.9, 7]], [10, 5, 8], splitting_type.BAYESIAN_INFORMATION_CRITERION);
  File "xmeans_tests.py", line 50, in templateLengthProcessData
    assert obtained_cluster_sizes == expected_cluster_length;
AssertionError

======================================================================
FAIL: testBicWrongStartClusterAllocationSampleSimple5 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "xmeans_tests.py", line 167, in testBicWrongStartClusterAllocationSampleSimple5
    self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE5, [[0.0, 1.0], [0.0, 0.0]], [15, 15, 15, 15], splitting_type.BAYESIAN_INFORMATION_CRITERION);
  File "xmeans_tests.py", line 50, in templateLengthProcessData
    assert obtained_cluster_sizes == expected_cluster_length;
AssertionError

----------------------------------------------------------------------
Ran 54 tests in 0.171s

are these simple_samples self-generated? I was also wondering (before when just one test was faling) why the same tests using ccore worked properly even though they used same BIC calculation.

annoviko commented 7 years ago

@tilmanbeck, these "simple samples" are very simple samples, they should be handled by every clustering algorithm because they are spherical and they have gaussian distribution. Anyway clustering result of 'SimpleSample2' seems very wierd.

At the beginning we have two centers [blue stars] and at the end we have got five centers [grey stars] and three of them are related to only one (true) cluster and as a result we have five clusters instead of three.

I agree that X-Means is not the best algorithm and it is not able to allocate clusters with complex forms, but this one is pretty easy. I think it should be able to handle it from my point of view. I am going to investigate this problem more deeply tomorrow.

If you get any findings, please, do not hesitate to report about them.

xmeans_simple_sample_2

annoviko commented 7 years ago

Have done at the revision https://github.com/annoviko/pyclustering/commit/46d90cc3d49ba0a14edb5ea5b8aa3cc32f692043 .