Evaluation of k-means for clustering

RedHatInsights / aiops-insights-clustering

Clustering of systems

GNU General Public License v3.0

1 stars 14 forks source link

Evaluation of k-means for clustering #18

Open Ladas opened 5 years ago

Ladas commented 5 years ago

Plotting SSE and silhouette coefficient (multiplying by 3M, so we can see it in 1 chart)

Trying to see what variance is in the features, we can see that around 245 components hold 99% of the variance.

Lets try to see SSE and silhouette coefficient when having only 245 components, transformed by PCA

Result: we can see that max silhouette coefficient is still around 0.43, on just 4 clusters, which has bad SSE. Then it stays around 0.3.

What sources say for silhouette coefficient:

0.71-1.0 | A strong structure has been found
0.51-0.70 | A reasonable structure has been found
0.26-0.50 | The structure is weak and could be artificial
< 0.25 | No substantial structure has been found

Which points to The structure is weak and could be artificial meaning we should try a different clustering method, since k-means seems to have bad score.

Ladas commented 5 years ago

cc @durandom @tumido @MichaelClifford does it make sense to you? The silhouette coefficient shows that k-means is probably not a good method in this case.

durandom commented 5 years ago

Maybe this is also because of the input data we are using. I've written up the next steps proposed here

durandom commented 5 years ago

Is this a measurement you could integrate into the metric_tracking package Whats the input of the silhouette coefficient? Just the clusters? Could you add the notebook or code to produce this?

Ladas commented 5 years ago

@durandom yes, I plan to add it in https://github.com/RedHatInsights/aicoe-insights-clustering/pull/14/files#diff-d0301332bd6fef353ec35837646aa49e once it is merged. Also, I'll need to figure out how to run it as separate jobs, now it takes hours to compute for 1 day on 4 cores.

Using the rule data from 2018-09-05 as input

durandom commented 5 years ago

Also, I'll need to figure out how to run it as separate jobs, now it takes hours to compute for 1 day on 4 cores.

Thats what the upshift environment is meant for. Create a build config and let it run there.

tumido commented 5 years ago

@Ladas thank you for the metrics! It makes total sense and just proves that we all know, that we know nothing. :smile:

There are probably multiple factors in effect:

data preprocessing (ironing out the structure too much)
applied clustering metod which doesn't fit the use case
input data composition (extracting wrong markers, missing important attributes, etc.)

I think this is the kind of insight into the clustering Marcel was looking for and these graphs would make it easier to compare different solutions. I'd love to see them as a part of the metrics tracking thingy Marcel's team is working on. :+1:

Ladas commented 5 years ago

FYI. seems like DBscan also performs bad, we'll need to munge the data before clustering

durandom commented 5 years ago

cc @TreeinRandomForest