3. Discovering Higher-Level Patterns - fundamental

JunsolKim commented 2 years ago

Post questions here for this week's fundamental readings: Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 10, 12, 6, 13 —“Principles of Discovery”, “Clustering”, “The Multinomial Language Model”, “Topic Models”.

GabeNicholson commented 2 years ago

Repeatedly within the chapters, the authors mention that there is no "one correct way to organize text". Yet this seems to be more of a statement of ignorance than a ground-level truth. For example, in the clustering example, they mention that different clustering algorithms tend to converge on similar clusters. This seems to me to indicate that there is some true correct way to cluster the text and the different algorithms are simply different approximations with some closer than others, but on average, they all hover the correct clusters. Given this straightforward interpretation, how can there not be a more correct way to organize the text even if we can never know for sure what the global minimum is and only approximate it?

pranathiiyer commented 2 years ago

The text mentions that in order to measure partition quality in clustering, we sum the dissimilarity of documents. However it goes on to mention that this does not account for distinction between the clusters themselves and might have to be accounted for in the objective function. Wouldn't understanding distinctiveness be a task of human interpretation? I couldn't understand how it might be accounted for in the objective function.

isaduan commented 2 years ago

Repeatedly within the chapters, the authors mention that there is no "one correct way to organize text". Yet this seems to be more of a statement of ignorance than a ground-level truth. For example, in the clustering example, they mention that different clustering algorithms tend to converge on similar clusters. This seems to me to indicate that there is some true correct way to cluster the text and the different algorithms are simply different approximations with some closer than others, but on average, they all hover the correct clusters. Given this straightforward interpretation, how can there not be a more correct way to organize the text even if we can never know for sure what the global minimum is and only approximate it?

'Correctness' is a tricky notion! One way I make sense of it is, people may not agree which way is the correct way of clustering a given set of texts for a research question. The purpose of clustering, as the authors try to exposit in this chapter, is to discover interesting patterns and inspire new theory development. You can imagine economists and anthropologists look at the same clusters, one finding them interesting, the other finding them useless! I do get your point that there are cases where people agree which approximations are better approximations of the reality, but I think there are also many cases they disagree, and quite reasonably so. After all, how to measure the distance between approximation to 'raw reality', if all measurements of raw reality contains approximation in themselves?

The text mentions that in order to measure partition quality in clustering, we sum the dissimilarity of documents. However it goes on to mention that this does not account for distinction between the clusters themselves and might have to be accounted for in the objective function. Wouldn't understanding distinctiveness be a task of human interpretation? I couldn't understand how it might be accounted for in the objective function.

I think you are right that judging how distinctive clusters are from each other is ultimately a task of human interpretation! But I think in the objective functions we have to give a measure of how good the partition is. There are many ways of doing it, and dissimilarity of documents or between clusters can be part of the measure. Only then can the algorithm iteratively improves it self.

Hongkai040 commented 2 years ago

I got lost in the Dirichlet Distribution...My brain overloaded...The authors says the Dirichlet Distribution provides us with the tools to specify a data generating process and they dumped the Dirichlet Distribution into the Multi-nominal Language Model. Why should we use it? What is the advantage of using it? Do we have any other alternative solutions?

Jasmine97Huang commented 2 years ago

I have a hard time understanding how multinomial distribution is an appropriate language model besides its statistical convenience. Is it just the limitation of bag-of-word models in general or am I missing something here. Autoregressive model would do a much better job at capturing the context and characteristics of a corpus. Is multinomial language model still relevant? In what situation do we prefer one over the other?

Jiayu-Kang commented 2 years ago

The authors mentioned that there is a lack of guidance on how to select a clustering method for a particular problem. While human intervention is necessary for detecting interesting organizations of texts, humans may not reach to an agreement during this process, and of course they also carry potential biases. Then, is it possible to have some "objective function that is easy to write down"? Or in other words, what do researchers need to develop "generally applicable theorems" that reduces the level of arbitrariness when choosing clustering methods?

mikepackard415 commented 2 years ago

I'm a little confused about what the authors were saying in Chapter 13 about the inclusion of 'upstream' and 'downstream' covariates in topic modeling. They write that "upstream covariate models allow for information to be shared, but unlike the downstream covariate models we will talk about below, the topics are not trying to explain the covariates." Then there is also a distinction between "upstream known content" and "upstream known prevalence" covariates. I guess I don't have a very specific question, I'm just looking to understand these distinctions a bit better.

MengChenC commented 2 years ago

I have a hard time understanding how multinomial distribution is an appropriate language model besides its statistical convenience. Is it just the limitation of bag-of-word models in general or am I missing something here. Autoregressive model would do a much better job at capturing the context and characteristics of a corpus. Is multinomial language model still relevant? In what situation do we prefer one over the other?

I wanted to second this one. From my perspective, topic modeling relies heavily on the assumptions, which may or may not be true in some cases, while the outcomes somehow perform empirically well. This is true for both static and dynamic topic models. However, how can we justify the findings without of generality based on huge assumptions?

facundosuenzo commented 2 years ago

The authors mentioned that there is a lack of guidance on how to select a clustering method for a particular problem. While human intervention is necessary for detecting interesting organizations of texts, humans may not reach to an agreement during this process, and of course they also carry potential biases. Then, is it possible to have some "objective function that is easy to write down"? Or in other words, what do researchers need to develop "generally applicable theorems" that reduces the level of arbitrariness when choosing clustering methods?

I found this iteration of humans and algorithms quite provocative to continue thinking about the abduction process mentioned by Timmermans & Tavory. I'd say that, following the principle of a more traditional content analysis approach, researchers should also evaluate the inter-coder reliability (ex. Krippendorff's Alpha-Reliability).

I also wondered how we could test both probabilistic and algorithmic approaches to clustering using the BIC statistic. Could that be possible?

Qiuyu-Li commented 2 years ago

In the first paragraph of chapter 10, the author puts data at the opposite and equivalent position as to thinking. This is quite interesting, because intuitively, we may put the word “observation” at the position of “data” in this sentence. It’s amazing how computers and information technologies have transformed our way of seeing and feeling the world with its 0-1 binary logic. However, I still feel reluctant to accept this logic. In the US congress ideology example, I’m a little bit confused about why people would accept the “ideal point” in the spectrum pictured by the DW-NOMINATE algorithm. What’s in the “gap” between Democrats and Republicans? If we can’t make sense of it, then how can we trust it, not to mention using it as the foundation for further analysis?

sizhenf commented 2 years ago

The authors claim that we should "judge the concept but not the method". On the one hand, I agree that a legit theory should and can be tested via a variety of methods. On the other hand, when we are building our models, there do seem to be a "right" or "wrong" way of doing so, and it seems like a pretty strong statement for us to not "judge the method" and how this idea can be understood the best.

ValAlvernUChic commented 2 years ago

The text mentions that in order to measure partition quality in clustering, we sum the dissimilarity of documents. However it goes on to mention that this does not account for distinction between the clusters themselves and might have to be accounted for in the objective function. Wouldn't understanding distinctiveness be a task of human interpretation? I couldn't understand how it might be accounted for in the objective function.

I agree with this! Unless I understand clustering wrongly, I can't see how an objective function can capture distinctiveness if we're not also tracking the overall meaning of the words within each cluster. If it's just a numerical signal, then it doesn't seem very useful from an interpretation standpoint!

hshi420 commented 2 years ago

For clustering, especially clusturing for documents in the context of social science, should we follow the hypothesis? In other words, is considering K as a priori hypothesis and see if the resulting clusters align with the hypothesis better than clutering then formulate hypothesis? Also, is it possible that the clutering results align with some hypothesis but because of different mechanisms? How should we deal with situations like this?

konratp commented 2 years ago

In chapter 10, the authors outline four core principles of discovering and conceptualizing data: 1. context relevance, 2. no ground truth, 3. judging the concept, not the method, and 4. separate data is best.

I'm a little concerned about the ethics behind the third principle, judging the concept, not the method. While it might not directly influence the results of the study, choosing a problematic method could produce problematic real-world side effects of the research. For example, I could see a 'group of online radicals' feeling legitimized and emboldened in their actions after a researcher uncritically engages in their discussions. There are so many examples of researchers performing actions that harm subjects, especially those from marginalized backgrounds. While I'm aware that this chapter is mostly about conceptualizing one's research, I'm a bit shocked that these statements aren't supplemented with at least a qualifying sentence or two.

It could also be that I myself am guilty of violating the principle of context relevance in even posting this question, so who truly knows!

NaiyuJ commented 2 years ago

My question target Chapter 12 Clustering. I find that real-world datasets typically do not fall into obvious clusters when I implement the algorithm at first. This is not to say that there is no such pattern in the dataset, but it's possible that we didn't find the most proper way to do this clustering. Based on my experience, I typically need to do a lot of back and forth to find the best one. One thing I'm really confused about is how to improve the clustering quality in different contexts or research topics. For example, sometimes, an improvement method may work for the corpus about political representation, but not work for the corpus about ethnic conflict.

sudhamshow commented 2 years ago

I understand that while using k-means for clustering of documents, the clusters formed depend very much on the initial placement of the cluster centers (means or the medians). Since they are placed randomly in the beginning, there is a possibility that multiple centers might aggregate together iteratively (thereby misrepresenting a single cluster as multiple portioned ones) or might converge in an area with no data points at all (in case of geometrically similar distributions). How are such cases taken care of and how are the centers initialised?

YileC928 commented 2 years ago

The author mentions smoothing and regularization for language models in Chapter 6. I am interested in learning more about other commonly used smoothing methods in addition to applying Laplace smoothing or using Dirichlet Distribution, and how to choose from them and tune them in practice.

Sirius2713 commented 2 years ago

In chapter 12 clustering, the authors talk about probabilistic clustering. Is it based on Bayesian methods? From their description, it seems that the procedure is to generate data with a priori distribution and adjust it with posterior information. However, the authors don't make it clear.

LuZhang0128 commented 2 years ago

The author emphasized at the end of the clustering chapter: there is "no one 'right' way to organize documents into categories." I wonder by trying different clustering, how do we address questions like overfitting, p-hacking, or reaching the local optimum? Or is it more about telling a good story based on the result? I also would like to learn more about chapter 6 :)

chentian418 commented 2 years ago

I understand that LDA is an extremely powerful tool for suggesting new organizations of documents. And just like clustering methods, LDA can be strongly dependent upon arbitrary tuning parameters and the choice of inference method. Although the author mentions that the variation can be useful learn different patterns of the documents, I am still curious about how robust LDA model to these tuning parameters and inference methods? Moreover, I am thinking that how can we turn the patterns reflected in LDA to some quantitative measurements, so that we could input them into supervised-learning models?

Emily-fyeh commented 2 years ago

Among all the chapters, I am most interested in the parts of how humans interpret and label the clusters and topic models (Chap 12 & 13). (also how to determine the number of clusters and topics) I feel like a (latent) hypothesis or expectation stemming from the background analysis is still crucial in this process. I am also curious about the example of discovering the fire grants related topic in the congress press release, are there other possible ways to discover the unseen pattern in this kind of content?

zixu12 commented 2 years ago

I am also interested in how to best defend my way of clustering given there is no "correct" way of clustering. Also, from the past learning experience, we always evaluate it by training dataset and control dataset assuming that we know some correct answers. In reality, if I am not sure about the correct answers, is there any other ways to evaluate my model as well?

kelseywu99 commented 2 years ago

In chapter 12, the authors went over styles of clustering methods yet there are questions still in regards to criteria in determining the usefulness of a clustering. . I was curious about the division between probabilistic and the algorithmic models and what determines a good clustering and what makes it the way it is?

UChicago-Computational-Content-Analysis / Readings-Responses-2023

3. Discovering Higher-Level Patterns - fundamental #41