ThinkBigAnalytics / pyspark-distributed-kmodes

MIT License
25 stars 23 forks source link

use n_clusters instead of n_partitions to locate the partition index #3

Open cinqs opened 6 years ago

cinqs commented 6 years ago

Hello, Thank you for creating this repo.

When I used this repo, I found that this line below should be:

partition_index = int(np.floor(index/n_clusters))

because the indexes of the clusters are continuous in a single partition, and you should get the partition_index by cluster_index / n_cluster_in_one_partition

Thanks. and should I post a merge-request?

https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes/blob/98b27d710380707983b3f57348b9255d5b33bb30/pyspark_kmodes/pyspark_kmodes.py#L314