yuliaUU commented 6 months ago

I am testing the KValid with the following dataset: https://www.kaggle.com/datasets/crawford/gene-expression

My settings: the output:

KValid
======

=== Clustering validation, using: Elbow method (SSE) ===

For k = 3
SSE: 307911.9827729013

For k = 4
SSE: 284224.5253161817

For k = 5
SSE: 265002.0829083235

For k = 6
SSE: 271168.4306434722

For k = 7
SSE: 255451.7226422783

For k = 8
SSE: 248347.4649960015

For k = 9
SSE: 245387.35604437318

For k = 10
SSE: 240754.6914872431

so based on the k should be 5: but what I see at the end, is 3 clsuters

=== Model and evaluation on training set ===

Clustered Instances

0      520 ( 65%)
1      145 ( 18%)
2      136 ( 17%)

Class attribute: Class
Classes to Clusters:

   0   1   2  <-- assigned to cluster
   0   0 136 | PRAD
 141   0   0 | LUAD
 300   0   0 | BRCA
   1 145   0 | KIRC
  78   0   0 | COAD

Cluster 0 <-- BRCA
Cluster 1 <-- KIRC
Cluster 2 <-- PRAD

Incorrectly clustered instances :   220.0    27.4657 %

Elbow plot also shows optimal number of clusters to be 5.

Also,

When I set cascade=false: no graph shows up! but optimal number of clusters is determined approprietly:

=== Model and evaluation on training set ===

Clustered Instances

0      187 ( 23%)
1      145 ( 18%)
2      136 ( 17%)
3       74 (  9%)
4      259 ( 32%)

Class attribute: Class
Classes to Clusters:

   0   1   2   3   4  <-- assigned to cluster
   0   0 136   0   0 | PRAD
 141   0   0   0   0 | LUAD
  41   0   0   0 259 | BRCA
   1 145   0   0   0 | KIRC
   4   0   0  74   0 | COAD

Cluster 0 <-- LUAD
Cluster 1 <-- KIRC
Cluster 2 <-- PRAD
Cluster 3 <-- COAD
Cluster 4 <-- BRCA

Incorrectly clustered instances :   46.0      5.7428 %

yuliaUU commented 6 months ago

I think, I am not interpreting correctly what cascade does and what numClasses does: if cascade is False and I set the numClass to 5, then the method just computes SSE for k=5:

however, when cascade is True and I set the numClass to 5, numClass is ignored, and SSE is calculated for all ks in the range of minimumK - maximumK. but in this case I don't understand how the optimal number of clusters is chosen ( as the chosen k had the highest SSE)

more digging from me:

Cascade method: selects the best k according to Calinski-Harabasz (CH) criterion. This criterion is sometimes called the variance ratio criterion (VRC). Well-defined clusters have a large between-cluster variance and a small within-cluster variance. The optimal number of clusters corresponds to the solution with the highest Calinski-Harabasz index value.
In contrast, SSE (Sum of Squared Errors) that we calculate for teh Elbow plot, also known as intra-cluster variation or within-cluster sum of squares, measures the sum of the squared distances between each data point and its centroid within a cluster. SSE quantifies the compactness of clusters; lower SSE values indicate tighter, more compact clusters.

So these two metrics measure different aspects:

CH Criterion (VRC):

Focuses on the overall separation and distinctiveness of clusters.
Higher CH index values indicate better-defined and well-separated clusters.
It considers both within-cluster variance (measures the dispersion of points within a cluster) and between-cluster variance (measures the dispersion between clusters).
A higher CH index suggests a better clustering solution.

SSE:

Measures the compactness or tightness of clusters.
Lower SSE values indicate that data points within clusters are closer to their centroids.
It only considers within-cluster variance; it does not take between-cluster variance into account.

Theldus commented 6 months ago

Hello @yuliaUU, Thank you for giving KValid a chance...

However... it's been 6/7 years since I last touched this code, and I've basically forgotten even how to use Weka,... I'm trying to replicate your steps, using this dataset you mentioned, but... I can't even import the CSV; I'm getting Attribute names are not unique! Causes: 'call', 'call', 'call'...

Please, tell me:

Which version of Weka are you using?
What is your Java version?
If possible, attach the .csv file you're using in your response (just drag-and-drop), and let me know if any special procedure is needed to open it.

yuliaUU commented 6 months ago

file is too big: cant upload here: i made a zip file for a subset of data: https://github.com/yuliaUU/data/blob/main/data.zip

yuliaUU commented 6 months ago

weka 3.8.6 ( the newest stable version), java was downloaded automatically with it

Theldus commented 6 months ago

Hi @yuliaUU, thanks for the additional data,

This plugin is based on SimpleKMeans, and when cascade is disabled, you are basically only using Weka's SimpleKMeans, so in general: if you want to use this plugin, you need to keep cascade enabled.

When cascade is activated, SimpleKMeans is then run for each k value configured in the range, and the Elbow or Silhouette is calculated for each of them.

Now, there's a set of things going on, some you might consider bugs and some you might not:

KValid only explicitly suggests the best K for Silhouette Index, as can be seen here:
```
=== Run information ===
```

Scheme: weka.clusterers.KValid -init 0 -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -validation 0 -cascade -minK 3 -maxK 10 -show-graph -S 10 Relation: data Instances: 208 Attributes: 16384 [list of attributes omitted] Test mode: evaluate on training data

=== Clustering model (full training set) ===

KValid

=== Clustering validation, using: Silhouette Index ===

For k = 3 Cluster 0: 0.1282, veredict: a non substancial structure was found! Cluster 1: 0.0427, veredict: a non substancial structure was found! Cluster 2: 0.1309, veredict: a non substancial structure was found! Mean: 0.1006, veredict: a non substancial structure was found!

For k = 4 Cluster 0: 0.2968, veredict: weak structure! [...] Cluster 3: 0.2723, veredict: weak structure! Mean: 0.1861, veredict: a non substancial structure was found!

For k = 5 Cluster 0: 0.1401, veredict: a non substancial structure was found! [...] Cluster 4: 0.2885, veredict: weak structure! Mean: 0.1559, veredict: a non substancial structure was found!

For k = 6 Cluster 0: 0.1401, veredict: a non substancial structure was found! [...] Cluster 4: 0.2875, veredict: weak structure! Cluster 5: 0.1340, veredict: a non substancial structure was found! Mean: 0.1842, veredict: a non substancial structure was found!

For k = 7 Cluster 0: 0.1213, veredict: a non substancial structure was found! [...] Cluster 5: 0.1128, veredict: a non substancial structure was found! Cluster 6: 0.0512, veredict: a non substancial structure was found! Mean: 0.1795, veredict: a non substancial structure was found!

For k = 8 Cluster 0: 0.1002, veredict: a non substancial structure was found! [...] Cluster 6: 0.0512, veredict: a non substancial structure was found! Cluster 7: -0.0444, veredict: a non substancial structure was found! Mean: 0.1572, veredict: a non substancial structure was found!

For k = 9 Cluster 0: 0.0998, veredict: a non substancial structure was found! [...] Cluster 7: -0.0444, veredict: a non substancial structure was found! Cluster 8: -0.0471, veredict: a non substancial structure was found! Mean: 0.1443, veredict: a non substancial structure was found!

For k = 10 Cluster 0: 0.1886, veredict: a non substancial structure was found! [...] Cluster 8: 0.0169, veredict: a non substancial structure was found! Cluster 9: 0.1013, veredict: a non substancial structure was found! Mean: 0.1218, veredict: a non substancial structure was found!

~~ Best K: 4 ~~ <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< here Please manually check your dataset to figure out if this is really the best K

kMeans

Number of iterations: 7 Within cluster sum of squared errors: 85198.41166791727

Initial starting points (random):

Missing values globally replaced with mean/mode

Final cluster centroids: Cluster# Attribute Full Data 0 1 2 (208.0) (75.0) (32.0) (101.0)

Samples sample_0 sample_0 sample_2 sample_1 gene_0 0.0261 0.0173 0.0336 0.0303 gene_1 2.89 2.9179 2.2744 3.0644 [etc]

Time taken to build model (full training data) : 21.64 seconds

=== Model and evaluation on training set ===

Clustered Instances

0 37 ( 18%) 1 21 ( 10%) 2 109 ( 52%) 3 41 ( 20%)


In this scenario, the number of clustered instances will match the best calculated K, in this case, 4.

2. For the Elbow, the best K is not suggested, and this is intentional: since the 'elbow' of the graph is a visual inspection, I cannot determine its best value via code.

In the current KValid code, I use the highest SSE value for Elbow's 'best K', but this is wrong, and this is reflected in the final number of clustered instances, which you should ignore.

In general:
- Silhouette Index: You can more or less trust the best suggested K and the number of clustered instances at the end of the test.
- Elbow: trust only the graph, and re-cluster again according to the best visually inspected K, the 'elbow' of the graph.

I hope this helps to clarify your questions.

yuliaUU commented 6 months ago

Yes! Thank you a lot for explanatipns!

Theldus / KValid

output of k-means and SSE does not match final number of clusters chosen #3

KValid

kMeans

Final cluster centroids: Cluster# Attribute Full Data 0 1 2 (208.0) (75.0) (32.0) (101.0)