elki-project / elki

ELKI Data Mining Toolkit
https://elki-project.github.io/
GNU Affero General Public License v3.0
780 stars 321 forks source link

CLIQUE - Connected dense units #53

Closed georgekatona closed 5 years ago

georgekatona commented 5 years ago

Inspecting results of elki's CLIQUE clustering implementations revealed an issue of the implementation. The original paper AGGR98 at paragraph 2.1 states that two dense units are connected if they have a common face (they are identical in n-1 dimensions and in 1 dimension they are neighboring).

According to this the following output should be impossible:

# Cluster: cluster_1
# Cluster name: cluster_1
# Cluster noise flag: false
# Cluster size: 500
# Model class: de.lmu.ifi.dbs.elki.data.model.SubspaceModel
# Cluster Mean: 0.5058922207711539, 0.5997058865585326
# Subspace: Dimensions: [1]
# Coverage: 500
# Units: 
#    d1-[0.04; 0.33[    127 objects
#    d1-[0.33; 0.62[    207 objects
#    d1-[0.62; 0.92[    166 objects

In this case all dense units are neighboring in both of the dimensions, yet the current implementation considers it one cluster.

kno10 commented 5 years ago

Can you please try to find the error and provide a patch?

There is noone else working on CLIQUE these days anymore.

Maybe it will then eventually produce useful results...

georgekatona commented 5 years ago

Unfortunately I don't have the time to debug it completely but as I briefly looked into it, I would start debugging around here.

kno10 commented 5 years ago

Actually I do not see what is wrong with above output. It is one cluster; it consists of three units. The cluster is n=1 dimensional. The three units are identical in n-1=0 dimensions, and they are neighbors in 1. So this seems to agree with the definitions.

georgekatona commented 5 years ago

Oh, you are right, I copied the wrong cluster. I experienced the error in the two dimensional feature space:

# Cluster: cluster_3
# Cluster name: cluster_3
# Cluster noise flag: false
# Cluster size: 333
# Model class: de.lmu.ifi.dbs.elki.data.model.SubspaceModel
# Cluster Mean: 0.4916644248110665, 0.6574075443435105
# Subspace: Dimensions: [1, 2]
# Coverage: 333
# Units: 
#    d1-[0.04; 0.33[ d2-[0.64; 0.90[    101 objects
#    d1-[0.33; 0.62[ d2-[0.39; 0.64[    127 objects
#    d1-[0.62; 0.92[ d2-[0.64; 0.90[    105 objects
kno10 commented 5 years ago

Yes, that does not look quite right. From a quick look, it may simply be missing the check for the other dimensions to be identical.

Can you share a test data set to reproduce the problem?

kno10 commented 5 years ago

I may have a fixed version ready to go, but I don't have a test case to verify this. Overall, the CLIQUE code is really old, dating back to ELKI 0.1. It also is not well optimized. But so is the algorithm: old, and not working too well on real problems.

georgekatona commented 5 years ago

I have run it on the mouse sample data set with xsi = 3, tau = 0.1.

kno10 commented 5 years ago

Thanks. It seems that the cleanup of CLIQUE in commit 474f81f resolves this. It should also be faster, but this is untested - the code is still not optimized much (but it probably also is not worth the time doing so, because of the method's limitations. P3C generally is much better.).