Computational-Content-Analysis-2018 / 19-Jan-Flat-Clustering

Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. “Flat Clustering” and “Hierarchical Clustering.” Chapters 16 and 17 from Introduction to Information Retrieval.
https://github.com/Computational-Content-Analysis-2018
0 stars 1 forks source link

Flat & Hierarchical Clustering #4

Open sunnyJy opened 6 years ago

sunnyJy commented 6 years ago

Page 361 Questions about the flat clustering algorithm 1) Intuitively, how to move the seed around? And how to do this efficiently? 2) What if the seeds happen to be the same class? In theory, they should be assigned to the same centroid and same cluster, instead of being two distinct centroids of two clusters? Should we check the similarity of the two seeds at the beginning?

Page 364 Questions about how to pick up the seeds efficiently The third bullet point mentioned the "lowest cost". How to measure the cost of the cluster? What if the seed obtained according to this bullet point happens to be an outlier or with other disadvantages?

Page 368 Questions about the model-based clustering It talks that the randomly selected seeds are regarded as centroids/ model that generates data, and the documents are the noise. The model which generates data and also recovers the original model is defined as clusters and an assignment of document to clusters. (Q1) generate "data" -- data refers to the numerical distance from the document to centroid? (Q2) how to verify the "model recovers the original one"?

Jane 2018.1.18

sunnyjooey commented 6 years ago

I think the problem with picking the wrong (high cost) seed can be somewhat remedied by running the clustering algorithm many times. Some iterations will produce clusters with higher cost (these you might want to ignore) and some lower cost (maybe the clusters you want). Roughly, cost can be measured by finding the center of each cluster and adding up the distance from this center to each of the observations. If you have an outlier as a center because you picked it as the seed, the cost will be very high because all the observations will be far away from the center. Then you might want to try again with a different seed:)