PyDataBlog / ParallelKMeans.jl

Parallel & lightning fast implementation of available classic and contemporary variants of the KMeans clustering algorithm
MIT License
50 stars 13 forks source link

Smart init needs to be thoroughly vetted and tested. Currently buggy and unstable #7

Closed PyDataBlog closed 4 years ago

PyDataBlog commented 4 years ago

Convergence seems to be unstable at different tolerance levels plus sum of squares values jump around a lot at different K values.

using Plots
using Clustering
using ParallelKMeans

X = rand(10000, 30);

@time a = [Clustering.kmeans(X', i).totalcost for i = 2:10];
@time b = [ParallelKMeans.kmeans(X, i, tol=1e-6, verbose=false)[end] for i = 2:10];

plot(a)
plot!(b)
Arkoniak commented 4 years ago

I think we should seek inspiration here: https://github.com/JuliaStats/Clustering.jl/blob/master/src/seeding.jl#L143

PyDataBlog commented 4 years ago

Extensively testing #13 to verify if it fixes this issue finally.

PyDataBlog commented 4 years ago

16 should give a more stable init. Further testing needed to verify