annoviko / pyclustering

pyclustering is a Python, C++ data mining library.
https://pyclustering.github.io/
BSD 3-Clause "New" or "Revised" License
1.16k stars 249 forks source link

G-Means: Setting maximum number of clusters like for X-Means #602

Closed tschechlovdev closed 4 years ago

tschechlovdev commented 4 years ago

Hi,

I wanted to ask if it is possible to add a k_max parameter to the call to gmeans? So Similar to xmeans, which support this parameter. The reason is that gmeans returns for some datasets a really large number of clusters (sometimes it is even the same of the size of the dataset, which is the worst case). I do not know the reason behind this, but it would be nice if I could limit the number of clusters as I can do for xmeans.

annoviko commented 4 years ago

Hi @tschechlovdev ,

The changes are introduced and available on master branch. Be aware that G-Means tries to use C++ implementation by default and you have to build C++ pyclustering by yourself if you are going to use version from master branch.

tschechlovdev commented 4 years ago

Hi @annoviko,

thank you for your fast support! That helps me a lot!

However, I think it does not work 100% correct, I am specifying for some dataset k_max=200, but getting results with Number of clusters: 226 Number of clusters: 223 ....

I am running the code


from pyclustering.cluster.gmeans import gmeans
import numpy as np

# data is here a synthetically generated dataset with make_blobs from sklearn
# https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html
gmeans_instance = gmeans(data, k_max=200)

labels = gmeans_instance.process().predict(data)

predicted_k = len(numpy.unique(labels))
print("Number of clusters: {}".format(predicted_k))

I am also a bit concerned because I am running gmeans on synthetic datasets with gaussian distributions that have for example 25 clusters using the make_blobs function from sklearn. Actually, I thought gmeans should be well suited for these datasets. Yet, gmeans is always running into k_max (or respectively a higher value) for me. Do you have any idea why this happens?

annoviko commented 4 years ago

Hi, @tschechlovdev Yes, that is possible, I've allowed G-Means to perform the last attempt to perform full statistical optimization before stop. Suppose we have 150 clusters on step N, and lets suppose that we have to split 76 clusters in line with the statistical optimization. As a result we have 226 clusters and we stop processing due to limit kmax=200. Otherwise it is going to be a partial statistical optimization.

But, I agree this is a confusing behavior. I will interrupt the statistical optimization as well.

annoviko commented 4 years ago

I am also a bit concerned because I am running gmeans on synthetic datasets with gaussian distributions that have for example 25 clusters using the make_blobs function from sklearn. Actually, I thought gmeans should be well suited for these datasets. Yet, gmeans is always running into k_max (or respectively a higher value) for me. Do you have any idea why this happens?

It is possible due to k-means++, this algorithm has randomized mechanism to find optimal centers for particular amount of clusters, thus there is a probability that amount of clusters are going to be different. There is an example with a simple data-set SAMPLE03: 1 2

annoviko commented 4 years ago

@tschechlovdev , As I case see you want to have labels instead of CLUSTER_INDEX_LIST_SEPARATION. There is a way how to do that:

from pyclustering.cluster.gmeans import gmeans
from pyclustering.cluster.encoder import type_encoding, cluster_encoder
from pyclustering.samples.definitions import SIMPLE_SAMPLES
from pyclustering.utils import read_sample

data = read_sample(SIMPLE_SAMPLES.SAMPLE_SIMPLE1)

gmeans_instance = gmeans(data).process()
clusters = gmeans_instance.get_clusters()

encoder = cluster_encoder(type_encoding.CLUSTER_INDEX_LIST_SEPARATION, clusters, data)
encoder.set_encoding(type_encoding.CLUSTER_INDEX_LABELING)
labels = encoder.get_clusters()

print(labels)

Output:

[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]

Because currently you perform cluster analysis of your data twice.

tschechlovdev commented 4 years ago

Hi, @tschechlovdev Yes, that is possible, I've allowed G-Means to perform the last attempt to perform full statistical optimization before stop. Suppose we have 150 clusters on step N, and lets suppose that we have to split 76 clusters in line with the statistical optimization. As a result we have 226 clusters and we stop processing due to limit kmax=200. Otherwise it is going to be a partial statistical optimization.

But, I agree this is a confusing behavior. I will interrupt the statistical optimization as well.

Ah, I see that makes sense.

It is possible due to k-means++, this algorithm has randomized mechanism to find optimal centers for particular amount of clusters, thus there is a probability that amount of clusters are going to be different. There is an example with a simple data-set SAMPLE03:

I see. That's why I get different results when trying multiple times. However, I tried with different datasets and it seems that GMeans has a problem with high dimensional (d>25) datasets. For these, it always runs into k_max (or respectively higher). For datasets with d < 20 it seems to work fine, but I think that is rather a problem with the algorithm in general.

@tschechlovdev , As I case see you want to have labels instead of CLUSTER_INDEX_LIST_SEPARATION. There is a way how to do that: ...

Thanks, didn't know about that way :)

annoviko commented 4 years ago

@tschechlovdev , I have already implemented additional parameter for g-means in scope of #578. It means that you can set your own random state as a random seed. In other words ability to control functions like random.seed(random_state) and numpy.random.seed(random_state) that are inside of algorithms that use random functionality like G-Means, X-Means, SyncNet, etc. Here is an example:

gmeans_instance = gmeans(data, random_state=1, kmax=200).process()

I think tomorrow I will be able to deliver this changes to master branch including the update for k_max and you will be able to try to use these parameters.

annoviko commented 4 years ago

@tschechlovdev ,

The changes regarded gmeans is on the master branch. You can use random_state to specify seed for the random engine and kmax to control population of clusters. C++ pyclustering library should be built in order to be able to use the changes.