bdelespierre / php-kmeans

PHP K-Means
MIT License
91 stars 41 forks source link

Get total variation (Elbow method) #31

Open bdelespierre opened 3 years ago

bdelespierre commented 3 years ago

In order to find the best value for K (the number of clusters), it would be nice to get the variance of the distance of clustered points to their cluster's centroid.

Inspired by https://www.youtube.com/watch?v=4b5d3muPQmA Also see https://en.wikipedia.org/wiki/Elbow_method_(clustering)

I also believe the current v3 implementation of RandomInitialization is wrong :man_shrugging:

Proposed change

$result = (new Kmeans\Algorithm($init))->clusterize($points, $K);
echo $result->getTotalVariance();
bdelespierre commented 3 years ago

See also https://stackoverflow.com/questions/6645895/calculating-the-percentage-of-variance-measure-for-k-means

battlecook commented 3 years ago

Implementing the elbow method is quite expensive to implement. getTotalVariance() is correct for implementing the elbow method. But implementing the elbow method requires more implementations. As you can see, kmeans has different results depending on the initial centroid position. This means that the elbow position can be different for each run. We also need a policy for averaging that elbow.

bdelespierre commented 3 years ago

From this ticket's scope, calculating the Elbow point is someone else's problem. We're just providing the variance here :wink:

battlecook commented 3 years ago

Oh, that's right. Then I understood. great.