When I used this repo, I found that this line below should be:
partition_index = int(np.floor(index/n_clusters))
because the indexes of the clusters are continuous in a single partition, and you should get the partition_index by cluster_index / n_cluster_in_one_partition
Hello, Thank you for creating this repo.
When I used this repo, I found that this line below should be:
because the indexes of the clusters are continuous in a single partition, and you should get the partition_index by
cluster_index / n_cluster_in_one_partition
Thanks. and should I post a merge-request?
https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes/blob/98b27d710380707983b3f57348b9255d5b33bb30/pyspark_kmodes/pyspark_kmodes.py#L314