Closed tilmanbeck closed 7 years ago
Hello, @tilmanbeck
Thank you for the findings, I have applied this correction and two tests are failed:
....................
Obtained: [15, 15, 15, 30] Expected: [15, 15, 15, 15, 15]
F...
Obtained: [800] Expected: [400, 400]
F.............................
======================================================================
FAIL: testBicWrongStartClusterAllocationSampleSimple4 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/andrei/workspace/pyclustering/pyclustering/cluster/tests/xmeans_tests.py", line 145, in testBicWrongStartClusterAllocationSampleSimple4
self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE4, [[1.5, 0.0], [1.5, 2.0], [1.5, 4.0], [1.5, 6.0]], [15, 15, 15, 15, 15], splitting_type.BAYESIAN_INFORMATION_CRITERION);
File "/home/andrei/workspace/pyclustering/pyclustering/cluster/tests/xmeans_tests.py", line 52, in templateLengthProcessData
assert obtained_cluster_sizes == expected_cluster_length;
AssertionError
======================================================================
FAIL: testBicWrongStartClusterAllocationSampleTwoDiamonds (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/andrei/workspace/pyclustering/pyclustering/cluster/tests/xmeans_tests.py", line 193, in testBicWrongStartClusterAllocationSampleTwoDiamonds
self.templateLengthProcessData(FCPS_SAMPLES.SAMPLE_TWO_DIAMONDS, [[0.8, 0.2]], [400, 400], splitting_type.BAYESIAN_INFORMATION_CRITERION);
File "/home/andrei/workspace/pyclustering/pyclustering/cluster/tests/xmeans_tests.py", line 52, in templateLengthProcessData
assert obtained_cluster_sizes == expected_cluster_length;
AssertionError
----------------------------------------------------------------------
Ran 54 tests in 0.170s
FAILED (failures=2)
I agree that at the same time other similar tests are passed (when we have wrong initial centers), and agree about the paper. I will investigate failures with these two tests and then I will apply your correction.
Or probably you have already got comments to these failures?
# The first failed test:
self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE4, [[1.5, 0.0], [1.5, 2.0], [1.5, 4.0], [1.5, 6.0]], [15, 15, 15, 15, 15], splitting_type.BAYESIAN_INFORMATION_CRITERION);
# The second failed test:
self.templateLengthProcessData(FCPS_SAMPLES.SAMPLE_TWO_DIAMONDS, [[0.8, 0.2]], [400, 400], splitting_type.BAYESIAN_INFORMATION_CRITERION);
Hi @annoviko ,
sorry, in my proposed correction I myself forgot the division by 2. I also simplified the math, should be:
return sum(scores) - ((K * (dimension + 1)) / 2.0) * math.log(N);
I applied your tests and now only the second test still fails. I will look deeper into the test.
okay, I replaced in line 369:
sigma += (euclidean_distance_sqrt(self.__pointer_data[index_object], centers[index_cluster]));
In the paper, they are also using the squared difference. That made it work, however now 4 other tests are failing:
======================================================================
FAIL: testBicClusterAllocationSampleSimple2 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
File "xmeans_tests.py", line 89, in testBicClusterAllocationSampleSimple2
self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE2, [[3.5, 4.8], [6.9, 7], [7.5, 0.5]], [10, 5, 8], splitting_type.BAYESIAN_INFORMATION_CRITERION);
File "xmeans_tests.py", line 50, in templateLengthProcessData
assert obtained_cluster_sizes == expected_cluster_length;
AssertionError
======================================================================
FAIL: testBicClusterAllocationSampleSimple5 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
File "xmeans_tests.py", line 161, in testBicClusterAllocationSampleSimple5
self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE5, [[0.0, 1.0], [0.0, 0.0], [1.0, 1.0], [1.0, 0.0]], [15, 15, 15, 15], splitting_type.BAYESIAN_INFORMATION_CRITERION);
File "xmeans_tests.py", line 50, in templateLengthProcessData
assert obtained_cluster_sizes == expected_cluster_length;
AssertionError
======================================================================
FAIL: testBicWrongStartClusterAllocationSampleSimple2 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
File "xmeans_tests.py", line 95, in testBicWrongStartClusterAllocationSampleSimple2
self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE2, [[3.5, 4.8], [6.9, 7]], [10, 5, 8], splitting_type.BAYESIAN_INFORMATION_CRITERION);
File "xmeans_tests.py", line 50, in templateLengthProcessData
assert obtained_cluster_sizes == expected_cluster_length;
AssertionError
======================================================================
FAIL: testBicWrongStartClusterAllocationSampleSimple5 (__main__.Test)
----------------------------------------------------------------------
Traceback (most recent call last):
File "xmeans_tests.py", line 167, in testBicWrongStartClusterAllocationSampleSimple5
self.templateLengthProcessData(SIMPLE_SAMPLES.SAMPLE_SIMPLE5, [[0.0, 1.0], [0.0, 0.0]], [15, 15, 15, 15], splitting_type.BAYESIAN_INFORMATION_CRITERION);
File "xmeans_tests.py", line 50, in templateLengthProcessData
assert obtained_cluster_sizes == expected_cluster_length;
AssertionError
----------------------------------------------------------------------
Ran 54 tests in 0.171s
are these simple_samples self-generated? I was also wondering (before when just one test was faling) why the same tests using ccore worked properly even though they used same BIC calculation.
@tilmanbeck, these "simple samples" are very simple samples, they should be handled by every clustering algorithm because they are spherical and they have gaussian distribution. Anyway clustering result of 'SimpleSample2' seems very wierd.
At the beginning we have two centers [blue stars] and at the end we have got five centers [grey stars] and three of them are related to only one (true) cluster and as a result we have five clusters instead of three.
I agree that X-Means is not the best algorithm and it is not able to allocate clusters with complex forms, but this one is pretty easy. I think it should be able to handle it from my point of view. I am going to investigate this problem more deeply tomorrow.
If you get any findings, please, do not hesitate to report about them.
Have done at the revision https://github.com/annoviko/pyclustering/commit/46d90cc3d49ba0a14edb5ea5b8aa3cc32f692043 .
according to the xmeans paper and the discussion here the term: p_j / 2 log(R) has to be subtracted from the sum of the log-likelihoods. So, shouldn't be line 383 in xmeans.py be replaced by: `return sum(scores) - ( K - 1 + dimension K + 1) * math.log(N)`