As discussed in this topic on Dask's forum, my colleague and I compared in a distributed environment the dask-ml implementation of the KMeans class with our own implementation. During the comparison, we observed that the dask-ml initialization doesn't appear to use weights during the centroid re-clustering phase.
In the current dask-ml KMeans implementation, the standard KMeans algorithm is used for centroid re-clustering. In contrast, we incorporated weights into two areas:
KMeans++ initialization.
Weighted average during centroid re-clustering.
Although our implementation is less efficient than dask-ml in terms of execution time, we achieved better results when clustering a blob dataset, likely due to a reduction in the number of clustering iterations rather than direct code optimizations.
If you're interested, feel free to review our repository for further details on our approach: GitHub Repository.
Hello,
As discussed in this topic on Dask's forum, my colleague and I compared in a distributed environment the
dask-ml
implementation of the KMeans class with our own implementation. During the comparison, we observed that thedask-ml
initialization doesn't appear to use weights during the centroid re-clustering phase.In the current
dask-ml
KMeans implementation, the standard KMeans algorithm is used for centroid re-clustering. In contrast, we incorporated weights into two areas:Although our implementation is less efficient than
dask-ml
in terms of execution time, we achieved better results when clustering a blob dataset, likely due to a reduction in the number of clustering iterations rather than direct code optimizations.If you're interested, feel free to review our repository for further details on our approach:
GitHub Repository.
Thank you for considering this issue.
Best regards,
Chiara