ThinkBigAnalytics / pyspark-distributed-kmodes

MIT License
25 stars 23 forks source link

A mistake in modes' initialisation leading to incorrect incremental k-modes implementation #1

Open asapegin opened 6 years ago

asapegin commented 6 years ago

https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes/blob/98b27d710380707983b3f57348b9255d5b33bb30/pyspark_kmodes/pyspark_kmodes.py#L253

On the line 253, modes are initialised by randomly selecting records from the whole rdd (and not from the corresponding partition).

As a result, on line 96, the random modes initialised from the whole dataset are used instead of modes for the particular partition.

In most cases, this results in undefined behaviour of the algorithm, i.e. in attempt to use records from one partition with the mode from another partition. In turn, this leads to empty clusters (which should never happen in the incremental k-means/k-modes, please see https://stackoverflow.com/questions/43445863/k-meansupdaing-centroids-successively-after-each-addition for details). In the code, it is solved with the function "check_for_empty_cluster(clusters, rdd)" which replaces the mode of emty cluster with a random one in the middle of k-modes execution.

All in all, this leads to the wrong results of the clustering.

cinqs commented 5 years ago

I don't see why this would leads to "the" wrong results of the clustering though I agree that the sampling of centroids of each partition should be sampled from each partition.

I agree mainly because I think the dataset would distributed abnormally / unevenly through partitions.

But think twice, randomly sampling means there would be sample resulting in an empty cluster, even kmeans. And check_for_empty_cluster is a fair enough mechanism for keeping enough clusters. and the clusters updated will find a better centroid for every record in every partition

cinqs commented 5 years ago

this repo is dead as I see, but their are still more work to do, like you said, sampling evenly from every partitions. and there is a fatal bug in the code of check_for_empty_cluster. and I worked to added weights to features when compute hamming distance, or use edit distance to features where needed, like strings.

you can fork it somewhere, may be we could contribute something to it again.

asapegin commented 5 years ago

Well, basically, it still may return some clusters, but this will not be k-means or k-modes anymore. As far as I know, this new kind of algorithm was never scientifically proved / checked to produce reasonable clustering results, so I would threat it as wrong. At least, it should not be called k-modes anymore.

asapegin commented 5 years ago

Thank you for your suggestion to fork it. Actually, I already have re-factored the code a lot, and added more distance functions. For example, frequency-based dissimilarity as described in "Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode" by Dong et al. However, as my work is a subject of NDA, unfortunately up to now there is no decision available if I can publish it.

cinqs commented 5 years ago

Thank you for your suggestion to fork it. Actually, I already have re-factored the code a lot, and added more distance functions. For example, frequency-based dissimilarity as described in "Improving K-Modes Algorithm Considering Frequencies of Attribute Values in Mode" by Dong et al. However, as my work is a subject of NDA, unfortunately up to now there is no decision available if I can publish it.

Well, I think I would read the paper you mentioned, and refactor and push...

hughshaoqz commented 5 years ago

Hi asapegin, thank you for pointing out the problem. I solved it by implementing the K_mode_partition method to make the model follows each step from the paper (Ensemble based Distributed K-Modes Clustering).

Briefly, what I did is I first randomly selected clusters in each partition and then did Kmodes within each partition. After that, I ran local Kmodes in master node for these centered clusters from each partition.

The problem is I find it works fine with one master node and two worker nodes (silhouette score is more than 0.7) but gets much worse if I choose more than two worker nodes (silhouette score is smaller than 0.3). Do you know why this happened? Thanks in advance.

asapegin commented 5 years ago

Do you also increase the number of partitions when you have more than 2 worker nodes? If yes, then you will have more "clusters" (n_clusters for each partition, e.g., if n_clusters = 10, with 2 partitions you will have 20 clusters before local kmodes, with 10 partitions you will have 100 clusters) in the end and the silhouette score after local kmodes may indeed become smaller. Cause local kmodes will find n_clusters in n_clusters*partitions points. More partitions = more points = maybe(!) lower silhouette score

hughshaoqz commented 5 years ago

Thank you so much for the reply! I did not increase the partition. What I did is I fix the partition into the same number (32) using coalesce and reparation method. Actually, I also tried to increase the partition, and the result is again very bad. I also tried many different hardware settings, and I found 1) if 1 worker node, no matter what hardware I choose, the result is always good. 2) if 2 worker node, only when the hardware is m4.2xlarge get the perfect result. 3) if more than 2 worker node, never get the perfect result... Do you by any chance know why this happened?

asapegin commented 5 years ago

Hi all! As I mentioned earlier, I have already performed a major refactoring of the code fixing this and also many other issues, as well as adding more distance functions. Now I have got a permission to publish the updated and fixed code. Please feel free to look at https://github.com/asapegin/pyspark-kmetamodes @hughshaoqz you can now try my implementation and see if it solves your problem. @cinqs Please be welcome to join me as co-developer of https://github.com/asapegin/pyspark-kmetamodes and contribute to it.