dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
895 stars 255 forks source link

KMeans Optimization: Incorporating Weights into Re-Clustering Process #1001

Open ChiaTrama opened 5 days ago

ChiaTrama commented 5 days ago

Hello,

As discussed in this topic on Dask's forum, my colleague and I compared in a distributed environment the dask-ml implementation of the KMeans class with our own implementation. During the comparison, we observed that the dask-ml initialization doesn't appear to use weights during the centroid re-clustering phase.

In the current dask-ml KMeans implementation, the standard KMeans algorithm is used for centroid re-clustering. In contrast, we incorporated weights into two areas:

Although our implementation is less efficient than dask-ml in terms of execution time, we achieved better results when clustering a blob dataset, likely due to a reduction in the number of clustering iterations rather than direct code optimizations.

If you're interested, feel free to review our repository for further details on our approach:
GitHub Repository.

Thank you for considering this issue.

Best regards,
Chiara

TomAugspurger commented 1 day ago

Thanks for sharing!