ThinkBigAnalytics / pyspark-distributed-kmodes

MIT License
25 stars 23 forks source link

IndexError in pyspark_kmodes #6

Open supreetkt opened 5 years ago

supreetkt commented 5 years ago

I'm receiving index error on the line #317: random_element = random.choice(clusters[biggest_cluster].members) I have a large dataframe (10000+ rows and 15+ columns). I tried this first with k=2. I debugged the program and it is because cluster_sizes gets 0 as value in two of its elements, but I'm not able to understand why.

If I limit my dataframe by say, a 100 rows, this error goes away, but then I get another error after 3 iterations of the algorithm: 'More clusters than data points?'

Any ideas on how to solve this?

asapegin commented 5 years ago

Yes, there could be some problems. As I mentioned in the issue I have created, this implementation of k-modes is incorrect, which leads to empty clusters. If you are interested, you can check my refactored version of k-modes.