PyDataBlog / ParallelKMeans.jl

Parallel & lightning fast implementation of available classic and contemporary variants of the KMeans clustering algorithm
MIT License
50 stars 13 forks source link

Metric duck typing for Yin Yang #91

Closed cstich closed 4 years ago

cstich commented 4 years ago

Elkan, Lloyd, and Hamerly all have duck typing for their metric argument, whereas Yin Yang only accepts Euclidean as a distance metric. This tiny pull requests brings the API for Yin Yang in line with Elkan and the others.

I am aware that strictly speaking the convergence of KMeans is only guaranteed for the Euclidean distance, but if there is a reason for only allowing Euclidean for Yin Yang and not the others, that is not clear to me.

Arkoniak commented 4 years ago

The main reason, why we have this limitation for YinYang is that in current state it will produce wrong results for any non-euclidean metric. Main internal functions, such as chunk_update_centroids or point_all_centers! have Square Euclidean logic embedded in them, for example, https://github.com/PyDataBlog/ParallelKMeans.jl/blob/master/src/yinyang.jl#L268

metric argument was added purely for compatibility with other algorithms, but proper removing of Eucledian restriction requires sufficient refactoring of the algorithm as well.

Thanks for bringing attention, I opened #92

cstich commented 4 years ago

That makes sense. Thanks for the explanation.