Open yuliaUU opened 6 months ago
I think, I am not interpreting correctly what cascade
does and what numClasses
does:
if cascade
is False and I set the numClass
to 5, then the method just computes SSE for k=5:
however, when cascade
is True and I set the numClass
to 5, numClass
is ignored, and SSE is calculated for all ks in the range of minimumK
- maximumK
. but in this case I don't understand how the optimal number of clusters is chosen ( as the chosen k had the highest SSE)
more digging from me:
So these two metrics measure different aspects:
CH Criterion (VRC):
SSE:
Hello @yuliaUU, Thank you for giving KValid a chance...
However... it's been 6/7 years since I last touched this code, and I've basically forgotten even how to use Weka,... I'm trying to replicate your steps, using this dataset you mentioned, but... I can't even import the CSV; I'm getting Attribute names are not unique! Causes: 'call', 'call', 'call'
...
Please, tell me:
file is too big: cant upload here: i made a zip file for a subset of data: https://github.com/yuliaUU/data/blob/main/data.zip
weka 3.8.6 ( the newest stable version), java was downloaded automatically with it
Hi @yuliaUU, thanks for the additional data,
This plugin is based on SimpleKMeans, and when cascade
is disabled, you are basically only using Weka's SimpleKMeans, so in general: if you want to use this plugin, you need to keep cascade
enabled.
When cascade
is activated, SimpleKMeans is then run for each k
value configured in the range, and the Elbow or Silhouette is calculated for each of them.
Now, there's a set of things going on, some you might consider bugs and some you might not:
=== Run information ===
Scheme: weka.clusterers.KValid -init 0 -N 2 -A "weka.core.EuclideanDistance -R first-last" -I 500 -validation 0 -cascade -minK 3 -maxK 10 -show-graph -S 10 Relation: data Instances: 208 Attributes: 16384 [list of attributes omitted] Test mode: evaluate on training data
=== Clustering model (full training set) ===
=== Clustering validation, using: Silhouette Index ===
For k = 3 Cluster 0: 0.1282, veredict: a non substancial structure was found! Cluster 1: 0.0427, veredict: a non substancial structure was found! Cluster 2: 0.1309, veredict: a non substancial structure was found! Mean: 0.1006, veredict: a non substancial structure was found!
For k = 4 Cluster 0: 0.2968, veredict: weak structure! [...] Cluster 3: 0.2723, veredict: weak structure! Mean: 0.1861, veredict: a non substancial structure was found!
For k = 5 Cluster 0: 0.1401, veredict: a non substancial structure was found! [...] Cluster 4: 0.2885, veredict: weak structure! Mean: 0.1559, veredict: a non substancial structure was found!
For k = 6 Cluster 0: 0.1401, veredict: a non substancial structure was found! [...] Cluster 4: 0.2875, veredict: weak structure! Cluster 5: 0.1340, veredict: a non substancial structure was found! Mean: 0.1842, veredict: a non substancial structure was found!
For k = 7 Cluster 0: 0.1213, veredict: a non substancial structure was found! [...] Cluster 5: 0.1128, veredict: a non substancial structure was found! Cluster 6: 0.0512, veredict: a non substancial structure was found! Mean: 0.1795, veredict: a non substancial structure was found!
For k = 8 Cluster 0: 0.1002, veredict: a non substancial structure was found! [...] Cluster 6: 0.0512, veredict: a non substancial structure was found! Cluster 7: -0.0444, veredict: a non substancial structure was found! Mean: 0.1572, veredict: a non substancial structure was found!
For k = 9 Cluster 0: 0.0998, veredict: a non substancial structure was found! [...] Cluster 7: -0.0444, veredict: a non substancial structure was found! Cluster 8: -0.0471, veredict: a non substancial structure was found! Mean: 0.1443, veredict: a non substancial structure was found!
For k = 10 Cluster 0: 0.1886, veredict: a non substancial structure was found! [...] Cluster 8: 0.0169, veredict: a non substancial structure was found! Cluster 9: 0.1013, veredict: a non substancial structure was found! Mean: 0.1218, veredict: a non substancial structure was found!
~~ Best K: 4 ~~ <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< here Please manually check your dataset to figure out if this is really the best K
Number of iterations: 7 Within cluster sum of squared errors: 85198.41166791727
Initial starting points (random):
Missing values globally replaced with mean/mode
Samples sample_0 sample_0 sample_2 sample_1 gene_0 0.0261 0.0173 0.0336 0.0303 gene_1 2.89 2.9179 2.2744 3.0644 [etc]
Time taken to build model (full training data) : 21.64 seconds
=== Model and evaluation on training set ===
Clustered Instances
0 37 ( 18%) 1 21 ( 10%) 2 109 ( 52%) 3 41 ( 20%)
In this scenario, the number of clustered instances will match the best calculated K, in this case, 4.
2. For the Elbow, the best K is not suggested, and this is intentional: since the 'elbow' of the graph is a visual inspection, I cannot determine its best value via code.
In the current KValid code, I use the highest SSE value for Elbow's 'best K', but this is wrong, and this is reflected in the final number of clustered instances, which you should ignore.
In general:
- Silhouette Index: You can more or less trust the best suggested K and the number of clustered instances at the end of the test.
- Elbow: trust only the graph, and re-cluster again according to the best visually inspected K, the 'elbow' of the graph.
I hope this helps to clarify your questions.
Yes! Thank you a lot for explanatipns!
I am testing the KValid with the following dataset: https://www.kaggle.com/datasets/crawford/gene-expression
My settings: the output:
so based on the k should be 5: but what I see at the end, is 3 clsuters
Elbow plot also shows optimal number of clusters to be 5.
Also,
When I set cascade=false: no graph shows up! but optimal number of clusters is determined approprietly: