Open jamesallenevans opened 4 years ago
Section 16.4.1 addresses how to choose the appropriate number of clusters K in flat clustering algorithms, stating that a naive approach would be minimizing the residual sum of squares (like OLS) and then brings up a second approach using , which mimics the ridge regression/Tikhonov regularization formula. Because ridge is generally used to reduce noisy data in supervised ML, how does this idea translate to choosing the best cluster size?
In section 17.7, the authors discuss some methods for cluster labeling, which is very important for interpreting model results. They discuss a few options (centroid words, mutual information, and document title closest to centroid) but these can be hard to interpret and/or misleading. Furthermore, when considering hierarchical clustering, there is added complexity when it comes to differentiating parent vs child nodes. Is there a common practice when it comes to flat or hierarchical cluster labeling? What are other ways clustering results can become more interpretable?
In chapters 16 and 17, Manning and Raghavan describe how flat and hierarchical clustering algorithms work. I’m excited about this method because I think that unsupervised learning can produce clusters that humans might not be able to recognize. Humans always have certain concepts in mind – our brains are pre-conditioned, no matter how hard we try to be objective. However, I’m curious whether unsupervised machine learning can really get us out of that tight spot. Using either a flat cluster or a hierarchical cluster will often lead to different results. For flat clusters, humans have to pre-specify the numbers of clusters. If those decisions are down to the researcher, how much objectivity have we really won? Is there a second-order algorithm to pick the best algorithm objectively?
The EM algorithm for flat clustering is related to the Naive Bayes approach for classification (p. 339). Prof. Evans mentioned that the Naive Bayes approach is no longer considered a state-of-the-art technique for classification. Is this also true for using the EM algorithm for clustering? My second question is how to choose the distribution for the algorithm. For example, how to choose between a multinomial model and a Gaussian model?
What is the major difference between EM algorithm mentioned in the chapter 16 and topic modelling? They are both algorithms to assign documents to some categories/topics, and they both allow one document to have several categories (and one categories to have several documents). They are both based on bayes theorem, a prior topic/cluster distribution and prior word distribution in each topic/cluster. I even find their cost functions and estimation methods quite similar. Are they the same method (or very similar method from the same family)?
Comparing the network analysis based on document-document matrix, the clustering of documents and the classification of documents based on topic-modeling, though they rest on different approaches and algorithms, the could all serve to cluster documents based on the proportion of words shared by the documents. Should we expect the clustering results produced by them similar to each other? If so, could they work as validation/robustness check for each other?
Can we go over evaluation procedures/metrics for these clustering methods? As with LDA and dynamic topic models, I'm not quite sure how you'd go about evaluating the results of flat or hierarchical clusters, as evaluation requires some second standard to compare results with and I'm not sure what that standard would be.
In addition to the algorithms mentioned in the two chapters, the link clustering methods make me wondering whether it is also possible to construct edges among documents based on whether they share keywords, and identify the network communities using methods such as weighted random walk, so that each community represents a cluster. Does it also work?
In the section of 18.3, it demonstrates the low-rank matrix approximations. Basically, it use the Frobenius norm of X to measure the discrepancy between the original matrix and the approximated matrix with lower rank. But how large the benchmark of the value of Frobenius norm can be regarded as a good approximation? Besides, when we conduct SVD, how we can determine the k mentioned in the book? In other words, for all the singular values on the diagonal, how we can determine how large the singular values we should keep and replace other singular values which are smaller than it with zeros?
The author briefly introduced the first story detection in 17.5, and it seems like a useful method to detect how media users develop new terms to describe certain events. If we are interested in applying it, what clustering method should we apply and how effective this technique is? On the other hand, if we are interested in detecting a series of new words generated in a period of time, is it possible to loop the code? Does the size of data affect its performance?
Chapter 17 introduces clustering in two different order, the bottom-up method-agglomerative clustering-and, the top-down method-divisive cluster. As it is claimed that
...There is evidence that divisive algorithms produce more accurate hierarchies than bottom-up algorithms in some circumstances...
I wonder if the same set of documents is given to these two methods, what would the results be like (select proper criterion for the algorithms)? Is there going to be any significant difference? What are the sources of such differences, if any?
I like the idea of bottom-up and top-down. In analogy to human processing, when having top-down thinking, it is more likely that people would see FROM the big picture and vice versa for bottoms-up. And that's why I think it is good to have machine do this in an unsupervised way, so that it can do clustering without any assumptions like humans.
"Since our goal is to optimize an objective function, clustering is essentially a search problem. The brute force solution would be to enumerate all possible clusterings and pick the best. However, there are exponentially many partitions, so this approach is not feasible" (356).
In what contexts assuming conditional independence would be reasonable to reduce the number of upper bounds on the number of clusterings?
I am wondering whether it is necessary to perform a more complex method which is less interpretable like hierarchical clustering. It greatly adds to the difficulty when dealing with the interpretation of results and obscures the concepts by compelling researchers to differentiate between parent nodes and child ones, which are sometimes unnecessary in practical applications.
In the flat clustering chapter, the author mentioned the importance of ‘distance’ for clustering. Different measures of ‘distance’ could greatly influence the outcome of clustering. Therefore, I am wondering, except for Euclidean distance, what are the other common measures of distance in clustering and what are their possible different effects on clustering?
When evaluating k-means, the algorithm 'terminates' when RSS falls below a threshold. What are the factors that go into determining this threshold, case by case? What are potential tradeoffs that we should keep in mind when making this choice?
On page 359 of the chapter, the authors state that "Separating similar documents is sometimes worse than putting pairs of dissimilar documents in the same cluster", but does not explain why.
Of course, I'm sure it's more difficult to separate semantically similar documents and when creating clusters, we would rather have true pairs. But from an interpretability viewpoint, would it not be more confusing to see unrelated pairs forming and creating false associations?
I am wondering about the different labeling process mentioned by the authors in page. 397. As we can see from the table that, different hierarchical methods are emphasizing different aspects of the same document, hence differences arise when it comes to labeling. But how does this process go? What are their criterions in labeling?
I didn't quite get the definition of medoid. How is a medoid empirically calculated, and why specifically is it faster to for calculating distance scores than centroids?
The notes on computational complexities (p.398) is very beneficial. It suggests using subset of documents of size \sqrt{N} when the number of documents is more than 1,000,000. I wonder what is the appropriate or acceptable way to choose those subsets from the original dataset.
The distinction made in chapter 16 between hard and soft clustering is unclear to me in terms of what sort of data each method is best suited to. The chapter cites "latent semantic indexing" but does not explain what makes it especially suited to soft clustering, for instance.
The paper explains one of the clustering techniques, k-means. While this is a great tool to group many documents by their intra-group similarity and inter-group distance, it does not guarantee that the residual sum of squares converges to the global minimum. I can imagine this issue becomes more significant when the data gets larger and complex with many outliers. How can I check if the RSS converges to one of the local minima, stuck in one place, or to the global minimum?
Some of the above questions mentioned the comparison/difference of the clustering algorithm that directly applied to data and that is based on the word network. Personally, I am interested in the possibility to combine those two methods. For example, by not only considering vector distance but also the network path between words (like a path in vector space). This combination idea mainly comes from the interest of identifying geometry/relationships of words given hierarchical clustering has potential drawbacks of information loss.
I have a rather basic question about the intuition of clustering. Like @di-Tong's question, in the word-document matrix case, does the k-means algorithm cluster documents based on the words each document has in common? How should I interpret what the algorithm does to the data? To give an example, if we want to cluster Twitter users based on their tweets, how should we interpret the results from clustering? Twitter users can be different based on many aspects. How do we know which aspects are picked up by the model?
For those clustering methods, My interest is how it could be combined with networks. Generally, we would take into account the connectivity of the nodes in networks when performing segmentations to a network. But I also believe that clustering methods or their logic would be helpful in such tasks. So I hope to see if there are any examples to make the network attributes of nodes the features for clustering that yield ideal results related to network structures or flows?
This paper gives a great theoretical background for flat and hierarchical clustering. Although hierarchical clustering is not restricted by a predetermined number of clusters and non-deterministic characteristics of flat clustering, if we want to prioritize efficiency, we may want to use flat clustering. Moreover, K-means is one of the most popular clustering techniques. I am thrilled by the clustering applications in the business world such as web searching and news update. I want to learn more about other application scenarios.
My question from reading these two chapters is, according to the chapters, k-medoids are more efficient because it computes medics instead of centroids as cluster centers, but I’m not sure I understand when these two approaches are applied? Also, the authors introduce an approach to find the optimum number of clusters by estimating RSS minimum in 16.4.1. I wonder how this is done in empirical settings, i.e. how do researchers use this or other approaches to find the optimum number of K in social science research and whether this approach is effective?
Manning, Christopher, Prabhakar Raghavan and Hinrich Schütze. 2008. “Flat Clustering” and “Hierarchical Clustering.” Chapters 16 and 17 from Introduction to Information Retrieval.