KMeans Optimization: Incorporating Weights into Re-Clustering Process

Hello,

As discussed in this topic on Dask's forum, my colleague and I compared in a distributed environment the dask-ml implementation of the KMeans class with our own implementation. During the comparison, we observed that the dask-ml initialization doesn't appear to use weights during the centroid re-clustering phase.

In the current dask-ml KMeans implementation, the standard KMeans algorithm is used for centroid re-clustering. In contrast, we incorporated weights into two areas:

KMeans++ initialization.
Weighted average during centroid re-clustering.

Although our implementation is less efficient than dask-ml in terms of execution time, we achieved better results when clustering a blob dataset, likely due to a reduction in the number of clustering iterations rather than direct code optimizations.

If you're interested, feel free to review our repository for further details on our approach:
GitHub Repository.

Thank you for considering this issue.

Best regards,
Chiara

dask / dask-ml

KMeans Optimization: Incorporating Weights into Re-Clustering Process #1001