5. Machine Learning to Classify and Relate Meanings - fundamental

lkcao commented 10 months ago

Post questions here for this week's fundamental readings:

Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 17, 18, 20, 21, 22 —“An Overview of Supervised Classification”, “Coding a Training Set”, “Checking Performance”, “Repurposing Discovery Methods”, “Inference”.

XiaotongCui commented 10 months ago

Chapter 20: I've actually had a persistent question – I find it challenging to clearly distinguish between the validation set and the test set. Since both are used to assess how well the model performs, and sometimes people even skip using a validation set in real life, what is the fundamental difference between them?

yueqil2 commented 9 months ago

I’m interested in the example of Nielsen’s supervision with found data in Chapter 18 and his validation in Chapter 20. It seems like an effective way to start supervised learning, but those chapters just elaborate Nielson’s research as an example instead of giving a general instruction on how to supervise with found data. When we aim to complete supervision with found data, what should we concern about during finding data? Is this approach recommended or prohibited? How to design validation test? Do I need to complete the whole eight-part validation as Nielsen did, or should I adjust the validation according to my own found data and assumptions?

bucketteOfIvy commented 9 months ago

Grimmer, Roberts, and Stewart (2022) spend considerable time discussing human coding of supervised datasets, emphasizing (but not fully restricting themselves to) discussion of traditional methods of data classification. However, these methods typically consume decent amounts of time and money, which can make them less accessible as approaches to data classification. Similarly, alternative methods -- such as using Amazon's Mechanical Turk -- may be cheaper, but can be less reliable, morally questionable, and still time consuming to utilize. These barriers feel particularly high for MA students doing research, who seem unlikely to have grant funding behind them.

Thus, do you have any advice for MA students seeking to make an initial training data set for usage in (e.g.) MA theses? Is it best to just stick to projects with potential found data-sources or preexisting gold-standard datasets, or are the barriers to dataset labeling lower than they may initially seem?

sborislo commented 9 months ago

In Grimmer et al.'s (2022) discussion of validation, I sensed two problems: (i) openness to post-hoc rationalization of certain findings (even if the model parameters/codebook are established a priori), and (ii) partial reliance on external validation.

For largely subjective constructs like "extremism," is there a way to a priori establish predictions for known data points (e.g., through pre-registration)? And how does the apparent partial reliance on external validation allow for surprising findings?

In experimental research, surprising findings can easily be attributed to study design or changes over time; however, with complex machine learning algorithms, this seems less realistic, especially for laypeople.

ethanjkoz commented 9 months ago

Grimmer et al. provide an elucidating discussion on supervised and unsupervised learning validation techniques in chapter 20 and 21. My question relates directly to my own project. What is the process for validating semi-supervised techniques? If we only have a small amount of ground truth labels, how can we externally validate the rest of a given corpus?

yuzhouw313 commented 9 months ago

In chapter 20 and 21, Grimmer et al. introduced the alternative label for training and evaluating the model -- surrogate label. They describe it as less ideal than gold-standard label but close enough for approximation. They also caution us about the issue of such label's tendency for demonstrating the extreme cases. I have 2 questions regarding to this concept: (1) if we do not have the resource or expertise to form a set of gold-standard label, what is the process of defining or developing the surrogate labels? (2) To find such surrogate labels, can we perhaps go back to previous week's topic and use clustering or topic modeling which are essentially unsupervised to generate these surrogate label?

cty20010831 commented 9 months ago

Related to text classification and inference, I am wondering what are the ways researchers can do to ensure that the inferences they draw from text classification models to be generalizable across different datasets and domains? What are some of the methods that could be taken to improve generalizability?

erikaz1 commented 9 months ago

Hypothesis Validity is a process where we have a theory and evidence-backed "hypothesis" (I think theory is more relevant?) that is more accurate than our model so that we conclude that a model is reasonable when conforms to the "hypothesis". Overall, this seems to capture a validation process that centers comparison with previous knowledge and intuition.

How is this measure typically used/applied? (I'm imagining something like a comparison of results to previous literature in the discussion section).

naivetoad commented 9 months ago

How does the heterogeneity of data sources (text, images, video) affect the approaches and challenges in making predictions and drawing causal inferences in computational content analysis?

Marugannwg commented 9 months ago

I wonder if the prosper of different LLM can support the labeling and validation in the supervised learning task. At the coding level, it seems to me that those model can serve as an alternative/supplement coders to hand-coding or crowdsourcing. Also, is it a good idea to consider the model opinion as some "surrogated label" as well? (considering we know the context and how those LLM are trained)

Vindmn1234 commented 9 months ago

Causal inference is about estimating the specific impact of changing one variable on another, focusing solely on the effect of interventions that could realistically be implemented. By integrating generative AI into causal inference research design, i think researchers can overcome some traditional limitations, such as small sample sizes, confounding variables, and the inability to directly observe counterfactual outcomes. I'm curious about how and whether causal inference study begins to leverage these AI models or not?

anzhichen1999 commented 9 months ago

From supervision with found data, what methodologies or validation techniques can be employed to assess and enhance the representativeness and neutrality of the training sets derived from such found data, particularly when the data is curated by individuals with specific ideological perspectives?

donatellafelice commented 9 months ago

is it always better to use pairwise judgements where possible when using untrained human coders (not just mTurk workers)? for example: in exercise 5 we are asked to have our friends and family help - should we use pairwise in all human coding for those that are not experts? what are some other examples of how we can discretize data in those cases that pairwise judgements do not work?>

Caojie2001 commented 9 months ago

Various supervised and unsupervised machine learning methods for measurement are introduced in Chapters 20 and 21. For the supervised learning part, since the golden standard data is not always available, sometimes we have no choice but to use the surrogate labels. In this case, what extra measures can be taken to enhance the conclusions drawn from these data?

ana-yurt commented 9 months ago

I find "Instability in Result" an interesting point. Since unsupervised learning is sensitive to many parameters and degrees of randomness, what are some of the ways we can validate the 'truth value' of supervised findings?

Dededon commented 9 months ago

I'm curious about how can we draw clear definitions when the social categories are hard to capture by a binary category. For example, like the social class backgrounds and other demographic labels of the research targets, or the style of language use by a 19th century American and 21st century British. In these cases, even the human labeling might have some issues. Should we only use classification under very conservative research designs?

michplunkett commented 9 months ago

Most classifiers come built-in with standard hyper-parameter defaults. I believe in chapter 20 they talk about no necessarily needing to always spend time on hyper-parameter optimization. Generally speaking, is it safe to go with default values for classifiers, using the assumption that time spent on optimizing hyper parameters generally leads to diminishing returns with regard to time spent?

runlinw0525 commented 9 months ago

Given the complexities and potential challenges of obtaining gold standard data for validation in supervised learning, I wonder what alternative approaches or strategies researchers might use to ensure the reliability and accuracy of their measurements (especially in situations where gold standard data is incomplete or unavailable)?

Brian-W00 commented 9 months ago

How can researchers effectively address the challenge of dataset shift, where the statistical properties of the data change over time, in supervised text classification models to maintain accuracy and relevance?

QIXIN-LIN commented 9 months ago

Based on prior experience, crafting a codebook can be a complex task. The challenge lies in the fact that individuals may interpret a given sample differently, leading to a diversity of perspectives. It's crucial to reach a consensus on the codebook's content; however, the need often arises to refine and enhance the codebook throughout the research process. Are there any strategies to ensure that the codebook remains useful, enhancing both the efficiency and accuracy of the research?

muhua-h commented 9 months ago

Reading through the Inference section, I seem to understand authors' emphasis on how casual inference is more rigorous than prediction. However, I would like to learn more on how this casual relationship is established. Do you have any resources (e.g., textbooks, videos, online courses) to recommend? Thank you.

volt-1 commented 9 months ago

Grimmer et al. point out that the scarcity of labeling resources is a common challenge when constructing text classification training sets. They suggest using surrogate labels as an alternative but also note that low-quality surrogate labels can negatively impact models. In this situation, if we adopt a strategy of "weakening" class boundaries to some extent, such as using labels from similar classes to annotate the same sample and expand the training set size, what results would this approach produce?

ddlxdd commented 9 months ago

I'm interested in evaluating model performance when dealing with a small dataset, which becomes even smaller after dividing it into separate validation and test sets. Would cross-validation be an effective strategy in this situation?

chenyt16 commented 9 months ago

During these two weeks, I have always had a question: how to define classes. I know it is supposed to be changed according to different corpora, but for some dense text, I am not sure what the optimal solution is. Should it be divided by sentence, paragraph, or something else? In last week's assignment, I tested that dividing different classes would affect the results of clustering, and the division of classes would also affect the selection of the training set. So I want to know if anyone has any suggestions on how to divide classes?

HamsterradYC commented 9 months ago

In the application of causal inference, how do we employ multiprocess effect models in scenarios with multiple interventions, while also addressing the identification and interpretation of interaction terms in predictive models?

Twilight233333 commented 9 months ago

I wonder if we can cross-validate multiple times during the manual labeling and validation process to improve performance, or does this lead to overfitting and reduced predictive power?

floriatea commented 9 months ago

Considering unsupervised methods' tendency to capture the most prominent features of text data, how can these methods be fine-tuned to detect and measure more nuanced or less frequent but equally significant patterns within large text corpora?

joylin0209 commented 9 months ago

Are there other methods or techniques that can be applied to text classification and analysis beyond the supervised machine learning methods presented in the book? For example, how can deep learning models be leveraged for text classification, and how well do they perform on different types of text tasks?

Carolineyx commented 8 months ago

Diversity of Meeting Contexts and Circumstances: We anticipate encountering a diverse range of meeting contexts and circumstances among the couples' stories, reflecting the unique paths that led to their initial encounters.
Narrative Structures and Storytelling Techniques: there will be variations in narrative structures and storytelling techniques employed by the couples to recount their 'how they met' stories.although this is a confounding variable, need to be controlled.
Themes of Serendipity and Emotional Resonance: themes of serendipity, destiny, and emotional resonance will permeate the narratives, resonating with readers and evoking a sense of connection with the couples' experiences.

Dataset Description: The dataset comprises 206 'how they met' stories extracted from The New York Times wedding announcements, spanning from 2006 to 2010. it can be accessed in their website.

Carolineyx commented 8 months ago

The distinction between prediction and causal inference is highlighted in the text. What exactly the differences between these two tasks and why it's important to avoid conflating them, particularly in computational social science research? How might this distinction impact the validity and applicability of models used in various domains, such as assessing the effects of policies or interventions?

JessicaCaishanghai commented 8 months ago

Causal inference along with cross validation can be supplemented of each other. I'm curious that how we can coordinate both?

UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

5. Machine Learning to Classify and Relate Meanings - fundamental #29