Open lkcao opened 10 months ago
I would like to inquire about a question regarding Chapter 6: The article mentions, 'Under the multinomial distribution, each word is generated independently of all other words. It is equivalent to taking M independent draws from the categorical distribution with parameter µ, and summing the resulting one-hot encoding vectors.' However, I feel that in language, words are often interconnected. For instance, words related to 'sun' may be associated with 'heat' and 'sunburn.' Does this approach seem somewhat imprecise?
It's cool to see the different ways clustering of text data can be done, even if the differences in results between methods don't vary as much as one might expect. However, there are some methods I don't understand the point of (relative to the others). For instance, why would anyone ever use hard clustering over soft clustering? Soft clustering seems to provide valuable information (at least on the "human interpretation" level) at no cost.
Chapter 10 highlights the key role of computational methods in social science research, particularly their ability to formulate and discover concepts. Based on the "no ground truth principle," I am considering choosing the most appropriate data organization method in different research contexts. Are there general guidelines to guide such choices? Furthermore, does the adaptability of organizational approaches affect the diversity found when solving other problems?
I was very interested in the part where the authors relate the probabilistic approach and algorithmic approach for clustering analysis, particularly on how dissimilarity measures are related to probability distributions behind. The authors mentions that there is a one-to-one mapping between a broad class of probability distributions and dissimilarity measures. Hence, I am wondering if there could be any examples of dissimilarity measures and their associated probability distributions?
The chapter 10 of principles of discovery can be the guiding pillar for our computational researches. I like the authors' philosophy that there's no true conceptualization of social phenomenon, and it is totally acceptable for researchers "fishing" for new conceptualization of the social world. I'm interested in Chap 13's session of introducing variants of the LDA model, like incorporating the upstream prevalence and content covariates into the model. Are there some ready-to-go python packages providing those LDA variants?
I really like the author's discussion on clustering, and I am curious about how to grasp the degree of clustering. If the results of unsupervised learning are meaningless, can we adjust them manually? Like the number of clusters? Or do we just have to look at other corpora?
I think that the multinominal language model described in Chapter 6 is quite similar to the vector space model introduced in Chapter 7, as they all stem from the same data source: distribution of words, and the order of words is neglected in both types of models. This similarity is also mentioned in the conclusion part of Chapter 6. Thus, I wonder why we still need the multinominal language model, since it seems that we don't need to pay additional attention to the existence of zero in the vector space model, which is a major issue when constructing multinominal language models. Meanwhile, can we regard the presence of zero as a result of the rather small size of the Jay corpus, which can be naturally solved after the corpus is expanded to a certain size?
Chapter 13 sparked a question for me: Can documents with identical word combinations have different themes if the order of words and sentences is altered? How would this affect our text analysis approach? I'm curious to know others' thoughts on the potential for identical words to convey varying meanings based on their arrangement.
In chapter 6, the author talks about some basic principles of the computational discovery, which are context relevance, no ground truth, judge the concept not the method, and separate data is best. I agree to the first principle especially, thinking that agathism is just the tools instead of the central topics. Regarding the content analysis, qualitative human anlysis is way much more important than the machine learning approach since humans understand words better. Context means presence, and the presence is the natural sentiments triggered by the environment, which is hard to be simulated by the algorithm. Abductive analysis is an appealing concept, in which I found the significance of qualitative methods like grounded theory. How can we better combine grounded theory and content analysis since they are inherently different approach?
In chapter 6., the author employed Laplace smoothing as a regularization technique to address the issue of zero probability in rare word occurrences, such as the absence of the word 'man' in John Jay's texts. Given that Laplace smoothing can potentially introduce bias, particularly in cases where the actual frequency of the word is indeed very low, how can we assess and mitigate the trade-off between reducing zero probabilities and introducing potential biases? Furthermore, could you explore the impact of different values of the smoothing parameter (alpha) on the model's performance, particularly in terms of its ability to distinguish between authors in cases of rare word usage, so that we can have a better understanding on the role of smoothing parameter?
In chapter 12 the authors discussed the uncertainty and limitation of clustering -- which is the "best" partition -- as such result will build on the dataset and evaluation criteria. However, as I was exploring the k-means method and trying out different clustering number manually, a question regarding text process intuition and efficiency arises: is there any general starting point at deciding what the initial/default clustering number and how to fine tune this parameter efficiently? Since the authors also mentioned that it is almost impossible to distinguish between different partitions, does this mean the ultimate goal is to find the best parameter within a partition approach based on their respective objective function? In other words, how do we find the balance between unsupervised learning and our presumption, as well as the balance between "exclusivity" and "cohesiveness" (p.138)?
In Chapter 6, Grimmer, Roberts, and Stewart (2022) argue that we ought judge concepts on their own merits, and not on the merits of their origins. However, especially when handling ML models, the origin can be relevant (as in the case of a model developed on a biased dataset). With that framing, is there an assertion that such issues will always be apparent in the resultant concept -- thus removing the need to think about where a given concept came from? If so, to what extent does that assertion hold water?
In Chapter 13 the authors describe the Latent Dirichlet Allocation, the most famous topic model. I am curious why the Dirichlet distribution performs the best among the family of possible multivariate distributions for the task of topic modeling. What is the property that makes it mimic our cognitive function of speech generation?
Building on the "Clustering", we could apply it to longitudinal studies, particularly in tracking and analyzing the evolution of textual information over time. What strategies can be employed to mitigate the risk of over-interpreting data, ensuring that these changes are accurately represented and not affected by natural language evolution or dataset biases?
Reflecting on the content in Chapter 6 about the multinomial language model and its application in text analysis, particularly in the context of the Federalist Papers authorship debate, I'm curious: How do you think the introduction of regularization methods, like Laplace smoothing or the use of a Dirichlet distribution, impacts the accuracy and reliability of authorship predictions in text analysis, especially when dealing with limited data from certain authors?
in chapter 10 .5 the authors talk about political corpera. they contend your research question is going to drive what model you use (in this case clusters). in chapter 12, they talk about how your research question will drive what granularity of clustering is most interesting.
The political voting record seemed like a confusing choice of corpera, they seem to have made a series of assumptions - why not use speeches? this seems to completely ignore how modern bills are passed - everything today is a compromise and many bills are tacked onto each other. So, my question is how do you make sure you're not just forcing your data into that? how do you know your corpora can support the level of granularity that you have assigned based on your question?
Though I understand the basics of one hot encoding, I'm still not entirely sure how we don't run into issues of multicollinearity. If we have n categories, then ought we have n - 1 dummy variables, so that each column remains linearly independent of one another. For example, let's say that our variable of interest was sex assigned at birth with categories of Female and Male. If we had a dummy variable where 0 was Female and 1 was Male, is this not preferred over one hot encoding assigning dummyF a value of 1 if Female and 0 otherwise and dummyM a value of 1 if male and 0 otherwise? Perhaps I am just fundamentally missing something but this question has been bugging me since I reviewed a little of what I learned in my undergrad ML course.
There seem to be too many choices to make around the clustering and topic modeling methods --- kinda intimidating. I'm listing here some questions I have regarding my dataset when reading:
This is my question after reading chapter 10 about Clustering. What challenges could be faced when applying clustering to large complex and diverse datasets? And how to solve it.
I am legitimately confused about Figure 6.1 here. Is this a 3d embedding projected to 2d?
Regarding three components of clustering approaches mentioned in chapter 12, in what ways do the choices made in these components affect the final partitions obtained in a clustering task?
I am also having question about the unsupervised learning methods, such as clustering and topic modeling, when is it appropriate to choose unsupervised models, especially how do we interpret the result? It seems the authors suggest to interpret mainly by reading the context to see if it is appropriate , without resorting to more objective measurements.
How does transforming data in various ways affect various clustering algorithms' results?
I am wondering if there are preferences for implementing supervised or unsupervised algorithms into probabilistic clustering models and algorithmic clustering methods.
Although the article discusses various clustering methods and their respective characteristics in detail, but does there are specific empirical reference standards based on data characteristics, clustering objectives, and processing capabilitieswhen choosing the best clustering method for specific studies such as sociology?
Chapter10 is quite similar to our orienting reading because both of them outline the value of untheorized data. However, Ch10 also insists another crucial part of discovery, that is, conceptualization. Before we start to observing and exploring a new dataset, should we have a concept? Or the conceptualization originate just from the discovery process?
Cluster analysis is truly amazing. I had always assumed its primary use was for dimension reduction in extensive, unstructured datasets. However, it's clear that clustering is more than just a tool for managing large data volumes; it's a great starting point for making new discoveries. I'm curious about the potential synergy between topic modeling and clustering. Would combining these two methods offer a better understanding of each cluster and its unique representations?
In the chapter on clustering, the authors discuss the threat of falling into a local optima, and provides an optimistic view stating that these optima can still provide interesting insights. However, if we are interested in finding the global optima, what methods exist for detecting if you are in a local optima? Are there methods that can be used to detect if you are about to fall into a local optima and instead skirt it? What types of different insights do local vs global optima provide?
Throughout the readings, and similarly with the prior chapters we've had assigned, I've been amazed at the application of mathematics to large corpora. Growing up in the school of thought that academic domains tend to be siloed off from each other, these readings certainly help to show that the current and future state of advanced research is more in the collaboration of domains than the isolation of them (e.g. synthetic biology, etc.). On that note, I generally come away from these readings with practical concerns about the application of these algorithms, equations, etc. and the technical skills required to implement them. Do there exist software packages that allow you to apply a multitude of distance, clustering, and foundational language models at once so that you can make the initial steps of corpora analysis less opaque? I'm thinking of some sort of job runner, or would using something like the currently existing NextFlow accomplish this task?
According to Chapter 10, is it also acceptable if we begin with no clear hypothesis and discover the theme and research questions along the way as we explore the data? (In a manner more similar to an anthropological approach?)
After topic modeling, we should label the topics to make them communicable. How can we justify our labels are reasonable?
Given the iterative nature of optimization algorithms in clustering, how can these algorithms be redesigned to better handle the ambiguity and diversity of text data, ensuring they provide meaningful and stable solutions in varied research contexts?
Chapter 6 emphasizes principles of computational discovery, notably context relevance and the superiority of human analysis for understanding context and sentiments in content analysis. It argues that qualitative methods, such as grounded theory, are crucial for capturing the nuances of context that algorithms miss. Despite their methodological differences, I assume the challenge here is integrating grounded theory with content analysis to leverage both the depth of qualitative insights and the breadth of quantitative data analysis.
Post questions here for this week's fundamental readings:
Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 10, 12, 6, 13 —“Principles of Discovery”, “Clustering”, “The Multinomial Language Model”, “Topic Models”.