Open lkcao opened 9 months ago
By reading the problems associated with the existing approaches to predict true aggregated proportion of all population documents that fall into each category, I was confused by the second problem - data generation assumes P(D|S) but true target is P(S|D). Based on my understanding, the goal, which not predicting individual document's D (document category), is to predict a aggregated proportion of D, which to me means to predict D. So I was not sure how measuring P(D|S) is a problem here. Hopkins and King provided the running example of blogger not discovering their opinions after posting but first have an opinion and express them, which makes sense. But I am not sure how does predicting the word features S fit in this example as we are not predicting what words the blogger used.
It is an amazing paper. I have seen a few paper using the method introduced in this one. I would like to learn more about how to adjust for inter-ratter reliability before applying supervised or semi-supervised machine learning models.
I think this paper echos the main argument of the orienting paper, namely, the inclusion of human part in machine learning. However, one question I have with the usage of word frequency analysis to infer the distribution of documents over categories is that how to ensure that word frequency alone can capture deeper and more abstract thematic or sentiment nuances in the text? In this case, why would not word embedding models be a greater option here?
I like this paper and most here make sense to me! But I wonder why they use both a simulated data set and empirical data set to test the new method. Does the simulated data set could provide a stronger validation than empirical data set?
Given the method's efficacy in categorizing text documents into predefined categories, how might this approach be adapted to handle data sets with inherent ambiguities or fluid categories, such as social media posts where new slang and expressions constantly emerge?
How does the new method developed for automated content analysis ensure unbiased estimates of category proportions across large datasets, and how does it perform when compared to traditional classifiers that focus on the accuracy of individual document classification?
Post questions here for this week's exemplary readings: