ThinkBigAnalytics / pyspark-distributed-kmodes

MIT License
25 stars 23 forks source link

TypeError in check_for_empty_cluster #2

Open kopytin opened 6 years ago

kopytin commented 6 years ago

Hello,

I am getting a TypeError in the current version of this module. Whether it appears or not depends on the number of clusters I request. On the same dataset, with 2 clusters requested I never see this error, with 4 clusters I see it sometimes, with 10 I always see it.

File "/usr/local/lib/python3.5/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 430, in fit self.n_clusters,self.max_dist_iter) File "/usr/local/lib/python3.5/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 271, in k_modes_partitioned clusters = check_for_empty_cluster(clusters, rdd) File "/usr/local/lib/python3.5/dist-packages/pyspark_kmodes/pyspark_kmodes.py", line 315, in check_for_empty_cluster partition_sizes = cluster_sizes[n_clusters(partition_index):n_clusters(partition_index+1)] TypeError: slice indices must be integers or None or have an index method

This is Spark 2.2. Any ideas will be appreciated.

cinqs commented 6 years ago

Hey, did you try to replace the n_partitions with n_clusters

As I explained here: https://github.com/ThinkBigAnalytics/pyspark-distributed-kmodes/issues/3

It seems this repo is no longer maintained by anyone...