Flat & Hierarchical Clustering

Computational-Content-Analysis-2018 / 19-Jan-Flat-Clustering

Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. “Flat Clustering” and “Hierarchical Clustering.” Chapters 16 and 17 from Introduction to Information Retrieval.

0 stars 1 forks source link

Page 361 Questions about the flat clustering algorithm 1) Intuitively, how to move the seed around? And how to do this efficiently? 2) What if the seeds happen to be the same class? In theory, they should be assigned to the same centroid and same cluster, instead of being two distinct centroids of two clusters? Should we check the similarity of the two seeds at the beginning?

Page 364 Questions about how to pick up the seeds efficiently The third bullet point mentioned the "lowest cost". How to measure the cost of the cluster? What if the seed obtained according to this bullet point happens to be an outlier or with other disadvantages?

Page 368 Questions about the model-based clustering It talks that the randomly selected seeds are regarded as centroids/ model that generates data, and the documents are the noise. The model which generates data and also recovers the original model is defined as clusters and an assignment of document to clusters. (Q1) generate "data" -- data refers to the numerical distance from the document to centroid? (Q2) how to verify the "model recovers the original one"?

Jane 2018.1.18

Computational-Content-Analysis-2018 / 19-Jan-Flat-Clustering

Flat & Hierarchical Clustering #4