UChicago-Computational-Content-Analysis / Readings-Responses-2024-Winter

1 stars 0 forks source link

5. Machine Learning to Classify and Relate Meanings - [E1] Hopkins, Daniel J. and Gary King. 2010. #33

Open lkcao opened 9 months ago

lkcao commented 9 months ago

Post questions here for this week's exemplary readings:

  1. Hopkins, Daniel J. and Gary King. 2010. A Method of Automated Nonparametric Content Analysis for Social Science. American Journal of Political Science 54(1): 229-247.
yuzhouw313 commented 8 months ago

By reading the problems associated with the existing approaches to predict true aggregated proportion of all population documents that fall into each category, I was confused by the second problem - data generation assumes P(D|S) but true target is P(S|D). Based on my understanding, the goal, which not predicting individual document's D (document category), is to predict a aggregated proportion of D, which to me means to predict D. So I was not sure how measuring P(D|S) is a problem here. Hopkins and King provided the running example of blogger not discovering their opinions after posting but first have an opinion and express them, which makes sense. But I am not sure how does predicting the word features S fit in this example as we are not predicting what words the blogger used.

muhua-h commented 8 months ago

It is an amazing paper. I have seen a few paper using the method introduced in this one. I would like to learn more about how to adjust for inter-ratter reliability before applying supervised or semi-supervised machine learning models.

cty20010831 commented 8 months ago

I think this paper echos the main argument of the orienting paper, namely, the inclusion of human part in machine learning. However, one question I have with the usage of word frequency analysis to infer the distribution of documents over categories is that how to ensure that word frequency alone can capture deeper and more abstract thematic or sentiment nuances in the text? In this case, why would not word embedding models be a greater option here?

yueqil2 commented 8 months ago

I like this paper and most here make sense to me! But I wonder why they use both a simulated data set and empirical data set to test the new method. Does the simulated data set could provide a stronger validation than empirical data set?

floriatea commented 7 months ago

Given the method's efficacy in categorizing text documents into predefined categories, how might this approach be adapted to handle data sets with inherent ambiguities or fluid categories, such as social media posts where new slang and expressions constantly emerge?

JessicaCaishanghai commented 7 months ago

How does the new method developed for automated content analysis ensure unbiased estimates of category proportions across large datasets, and how does it perform when compared to traditional classifiers that focus on the accuracy of individual document classification?