5. Classifying Meanings & Documents - fundamental

JunsolKim commented 2 years ago

Post questions here for this week's fundamental readings: Grimmer, Justin, Molly Roberts, Brandon Stewart. 2022. Text as Data. Princeton University Press: Chapters 17, 18, 20, 21, 22 —“An Overview of Supervised Classification”, “Coding a Training Set”, “Checking Performance”, “Repurposing Discovery Methods”, “Inference”.

pranathiiyer commented 2 years ago

Chapter 21 briefly touches upon this, yet, are there ways of validating continuous outputs generated by unsupervised models such as topic models wherein outputs need not be coarsened to ensure comparability with hand coded sets?

isaduan commented 2 years ago

(1) Would appreciate clarification on why Naive Bayse models tend to choose the right category but poorly calibrate the predicted probabilities in a given class.

(2) If we repurpose discovery methods for measurement, how do we establish a null hypothesis? what would no patterns look like?

GabeNicholson commented 2 years ago

What do the authors mean by, "we emphasize trying to build in an evaluation loop- a process of validating and subsequently updating the measure at regular intervals throughout the research process and after deployment"? Because if they are updating the measure each time, then the p-values and model estimates will be invalid since they are making data-dependent choices. Or am I misunderstanding this?

konratp commented 2 years ago

A mentor of mine during my undergrad once told me that as political scientists, we don't predict things, we merely analyze events that already happened. My understanding is that causal inference is much more common than prediction in the social sciences. In chapter 22, the authors compare prediction and causal inference, which made me wonder: What are circumstances under which prediction is the preferred method over causal inference?

melody1126 commented 2 years ago

In research, is the unsupervised learning methods we learned in previous weeks (clustering, topic modeling, word embedding, etc.) favored over supervised learning methods that require hand-coding? If there are existing hand-codes of words, and we can apply unsupervised methods to those, how would that method compare with supervised learning methods?

ValAlvernUChic commented 2 years ago

Chapter 22 got me thinking about how we could reliably make causal inferences about the mechanisms of discourse and its effects on wider socio-cultural behavior, without having to go through the toil of individual experiments and then using those results to make broader inferences about society. Prediction is totally possible - 'if we frame the discourse in one way, we can predict the population will react in this way' but I can't (I guess for now?) see how we can do 'framing discourse in one way will cause this effect via this mechanism' when applied to larger society, especially when behavior is aggregated and confounders are aplenty.

NaiyuJ commented 2 years ago

Re Chapter 19: Can I perform unsupervised learning followed by supervised learning? that is, classifying documents using both supervised learning and unsupervised learning. For example, can I first cluster my data, and then try to see if the cluster information helps my classification task. Would the combination of supervised and unsupervised methods make the results more accurate?

Sirius2713 commented 2 years ago

For hauma-coded documents, is there a way to automatically label documents based on partially hand-coded data?

sudhamshow commented 2 years ago

1) While classification of documents using labelled data provides robust results on unseen data (albeit with a significant degree of bias(error)), the caveat is that one usually can't predict the number of topic/sub topics prevalent in the area, if the subject is not fully explored. This is especially important when trying to make causal reasoning behind classification to a particular category. While there might be some truth to the classification, the actual mechanics could be inconspicuous if the topic is correlated to the classified label. Is it possible to explore the ground truth by creating more subtopic clustering (using unsupervised methods like KNN or K-Means) or from semi supervised learning (with label propagation)?

2) How are topics and words (text data in particular) usually spread in the high dimensional space? Simple supervised classifiers like Naive Bayes and logistic regression can detect sentiment with reasonable accuracy. Does this mean text is organised in a setting easily separable by a hyperplane? Can they be organised is complicated manifolds in any settings (due to context, language, etc)?

Jasmine97Huang commented 2 years ago

There seem to be a lot of software used by company/industry that utilize human-assisted classification/tagging methods where the ML algorithms tag entities of interest and display for human subject matter exports to further annotate. I think it would also interesting for social scienctists to incorporate this kind of training method into crowd source platforms like MTurk.

hshi420 commented 2 years ago

What is the usual threshold of the size the the labeled dataset that is large enough to train a model that can accurately do automatic labeling?

Jiayu-Kang commented 2 years ago

Re Chapter 19: Can I perform unsupervised learning followed by supervised learning? that is, classifying documents using both supervised learning and unsupervised learning. For example, can I first cluster my data, and then try to see if the cluster information helps my classification task. Would the combination of supervised and unsupervised methods make the results more accurate?

I think you can do so! But I'm concerned about whether that would make the results less meaningful though - if the clusters don't make much sense and/or are not very informative (which is very likely), then you might not get very useful results for interpretation. I'm curious that, if a supervised learning is using features from unsupervised learning (the evaluation and selection of features may still require human effort), is it still considered "supervised learning"?

LuZhang0128 commented 2 years ago

I wonder if supervised learning, with the ground truth label, would always perform better than unsupervised learning? For instance, if we have a bunch of documents that we would like to cluster into 4 different categories. We can achieve this goal by putting them into a k-means algorithm. Meanwhile, we can also hand-code some of them using supervised learning. Will supervised learning produce a better result?

Qiuyu-Li commented 2 years ago

In practice, what is the common number of coding rounds?
What is the algorithm Outlook and Google mail used for spam identification?

mikepackard415 commented 2 years ago

I'm interested in this concept of using discovery techniques (topic models, word embeddings) and then applying classification/inference techniques with the discovered concepts. How does that work in practice?

YileC928 commented 2 years ago

I'm interested in this concept of using discovery techniques (topic models, word embeddings) and then applying classification/inference techniques with the discovered concepts. How does that work in practice?

I second Mike's question. How might unsupervised and supervised methods be combined together to produce better results?

kelseywu99 commented 2 years ago

In chapter 18, the authors touch on the coding in training a coding set. I was curious about the pros and cons of hand-coding and adapting existing code books? How does one weigh more than the other in terms of benefits under what situation?

chentian418 commented 2 years ago

I have a question about Inference in Chapter 22. When we try to predict some target outcome, we may want to use high-dimension inputs like the word vectors from word-embedding models. They how do we do inference on these high-dimensional vectors?

Moreover, as the prediction model becomes more complex, how do we do inference on the independent variables of the machine learning models to interpret their effects on outcomes?

Emily-fyeh commented 2 years ago

I am also curious about how exactly to repurpose discovery models, and how do we measure the result of repurposing? Perhaps there exists common practices in each field or in most time the categorizations are fixed so no need to utilize the unsupervised method.

Hongkai040 commented 2 years ago

Repurposing Discovery methods reminds me of the papers we read in Week3, especially the one using topic modeling to identity the flow of ideas/word patterns amonh parliament members. In that case, I think the authors didn't perform validation. This is a quantitative research and yielded many meaningful results. It that a kind of measurement or just discovery? I'm kind of confused about what is the boundary of discovery and measurement? Is it necessary to split our dataset?

ttsujikawa commented 2 years ago

I was wondering how could confidently label datasets in the supervised setting? Due to its nature, we are supposed to know dividing lines among data points before running, but this process could be highly inductive and lead to possible pseudo-science research. In order to avoid this, labeling should be evidence-based to some extent. How should we approach this issue?

UChicago-Computational-Content-Analysis / Readings-Responses-2023

5. Classifying Meanings & Documents - fundamental #29